
v4.0 [Mar 29, 2011]
Share GPUs across multiple threads
Use all GPUs in the system concurrently from a single host thread
No-copy pinning of system memory, a faster alternative to cudaMallocHost()
C new/delete and support for virtual functions
Support for inline PTX assembly
Thrust library of templated performance primitives such as sort, reduce, etc.
NVIDIA Performance Primitives (NPP) library for image/video processing
Layered Textures for working with same size/format textures at larger sizes and higher performance
Unified Virtual Addressing
GPUDirect v2.0 support for Peer-to-Peer Communication
Automated Performance Analysis in Visual Profiler
C debugging in CUDA-GDB for Linux and MacOS
GPU binary disassembler for Fermi architecture (cuobjdump)
Parallel Nsight 2.0 now available for Windows developers with new debugging and profiling features.
Offers basic functionality such as reading and erasing diagnostic trouble codes