Felix Pinkston
Jan 21, 2026 21:57
NVIDIA simplifies GPU growth with CUB single-call API in CUDA 13.1, eliminating repetitive two-phase reminiscence allocation code with out efficiency loss.
NVIDIA has shipped a big quality-of-life improve for GPU builders with CUDA 13.1, introducing a single-call API for the CUB template library that eliminates the clunky two-phase reminiscence allocation sample builders have labored round for years.
The change addresses a long-standing ache level. CUB—the C++ template library powering high-performance GPU primitives like scans, kinds, and histograms—beforehand required builders to name every perform twice: as soon as to calculate required reminiscence, then once more to truly run the algorithm. This meant each CUB operation seemed one thing like this verbose dance of reminiscence estimation, allocation, and execution.
PyTorch’s codebase tells the story. The framework wraps CUB calls in macros particularly to cover this two-step invocation, a workaround widespread throughout manufacturing codebases. Macros obscure management stream and complicate debugging—a trade-off groups accepted as a result of the choice was worse.
Zero Overhead, Much less Code
The brand new API cuts straight to the purpose. What beforehand required express reminiscence allocation now matches in a single line, with CUB dealing with momentary storage internally. NVIDIA’s benchmarks present the streamlined interface introduces zero efficiency overhead in comparison with the guide strategy—reminiscence allocation nonetheless occurs, slightly below the hood by way of asynchronous allocation embedded inside gadget primitives.
Critically, the outdated two-phase API stays obtainable. Builders who want fine-grained management over reminiscence—reusing allocations throughout a number of operations or sharing between algorithms—can proceed utilizing the present sample. However for almost all of use circumstances, the single-call strategy ought to turn into the default.
The Setting Argument
Past simplifying fundamental calls, CUDA 13.1 introduces an extensible “env” argument that consolidates execution configuration. Builders can now mix customized CUDA streams, reminiscence sources, deterministic necessities, and tuning insurance policies by a single type-safe object fairly than juggling a number of perform parameters.
Reminiscence sources—a brand new utility for allocation and deallocation—may be handed by this setting argument. NVIDIA gives default sources, however builders can substitute their very own customized implementations or use CCCL-provided options like gadget reminiscence swimming pools.
Presently, the setting interface helps core algorithms together with DeviceReduce operations (Cut back, Sum, Min, Max, ArgMin, ArgMax) and DeviceScan operations (ExclusiveSum, ExclusiveScan). NVIDIA is monitoring further algorithm assist by way of their CCCL GitHub repository.
Sensible Implications
For groups sustaining GPU-accelerated functions, this replace means much less wrapper code and cleaner integration. The CUB library already serves as a foundational part of NVIDIA’s CUDA Core Compute Libraries, and simplifying its API reduces friction for builders constructing customized CUDA kernels.
The timing aligns with broader business motion towards extra accessible GPU programming. As AI workloads drive demand for optimized GPU code, decreasing boundaries to utilizing high-performance primitives issues.
CUDA 13.1 is offered now by NVIDIA’s developer portal. Groups at the moment utilizing macro wrappers round CUB calls ought to consider migrating to the native single-call API—it delivers the identical abstraction with out the debugging complications.
Picture supply: Shutterstock

