
The NVIDIA CUDA Toolkit provides a development environment for creating high performance GPU-accelerated applications. With the CUDA Toolkit, you can develop, optimize and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C compiler and a runtime library to deploy your application.
v4.0 [Mar 29, 2011]
Share GPUs across multiple threads
Use all GPUs in the system concurrently from a single host thread
No-copy pinning of system memory, a faster alternative to cudaMallocHost()
C new/delete and support for virtual functions
Support for inline PTX assembly
Thrust library of templated performance primitives such as sort, reduce, etc.
NVIDIA Performance Primitives (NPP) library for image/video processing
Layered Textures for working with same size/format textures at larger sizes and higher performance
Unified Virtual Addressing
GPUDirect v2.0 support for Peer-to-Peer Communication
Automated Performance Analysis in Visual Profiler
C debugging in CUDA-GDB for Linux and MacOS
GPU binary disassembler for Fermi architecture (cuobjdump)
Parallel Nsight 2.0 now available for Windows developers with new debugging and profiling features.