Rongchai Wang
Jun 08, 2026 18:38
NVIDIA’s NVFP4 permits 4-bit precision coaching on Blackwell GPUs, delivering as much as 73% sooner throughput for Llama fashions with out accuracy loss.

NVIDIA has unveiled its new NVFP4 mixed-precision format, designed to speed up large-scale mannequin coaching on its Grace Blackwell GPUs. By leveraging 4-bit precision, NVFP4 delivers important efficiency beneficial properties for duties like pretraining massive language fashions (LLMs), providing as much as 73% sooner throughput in comparison with the FP8 baseline, based on information launched on June 8, 2026. These developments permit AI groups to coach bigger fashions in much less time, with no measurable accuracy trade-offs.
JAX, a high-performance library common for machine studying workflows, performs a central position on this breakthrough. NVIDIA built-in NVFP4 into its TransformerEngine and MaxText frameworks, enabling scalable LLM pretraining on Blackwell {hardware}. Max Xu, the creator of the announcement, highlighted that NVFP4 can deal with the trillions of tokens and 1000’s of accelerators concerned in fashionable AI coaching with unprecedented effectivity.
How NVFP4 Speeds Up Coaching
The NVFP4 format employs revolutionary methods to protect accuracy whereas pushing precision boundaries. Key options embrace:
- Micro block scaling: Smaller 16-element blocks scale back errors attributable to outlier values.
- Random Hadamard Remodel: Gaussianizes weight gradients to attenuate noise throughout quantization.
- 2D weight scaling: Ensures constant values throughout transposed gradients and ahead propagation.
- Stochastic rounding: Prevents small updates from being misplaced because of rounding errors.
These methods are notably impactful in feed-forward layers of transformers—the computational bottleneck in most LLMs—the place NVFP4 replaces FP8 precision. GEMM operations (basic matrix multiplications) in these layers are quantized to NVFP4, considerably decreasing computational overhead, whereas sustaining increased precision for consideration mechanisms to mitigate quantization noise.
Efficiency Positive factors
Benchmarks utilizing the Llama 3 sequence fashions illustrate NVFP4’s effectivity. For Llama 3.1 (405 billion parameters), coaching on NVIDIA’s GB300 Grace Blackwell Extremely Superchip achieved a 1.73x speedup versus FP8. Per-GPU throughput jumped from 2,103 TFLOPs (FP8) to three,633 TFLOPs (NVFP4), underscoring the format’s capacity to maximise {hardware} utilization.
NVIDIA additionally demonstrated that these beneficial properties come with out accuracy loss. Coaching loss for Llama 3 8B fashions adopted almost similar curves throughout 10,000 steps, with a negligible distinction of 0.026 nats in converged outcomes. This stability makes NVFP4 a compelling choice for production-scale AI methods, the place price and time financial savings are vital.
Why It Issues for the AI Ecosystem
JAX, already favored for its scalability and just-in-time (JIT) compilation, advantages considerably from NVFP4 integration. NVIDIA’s launch aligns with broader traits within the AI coaching ecosystem, the place effectivity per GPU hour is more and more prioritized. For instance, earlier in 2026, NVIDIA reported long-context coaching speedups for JAX workloads utilizing NVSHMEM inside XLA, and new JAX-based libraries like jNO are increasing its functions in neural operators and basis mannequin coaching.
The NVFP4 replace positions NVIDIA and JAX to stay aggressive in opposition to alternate options like PyTorch or customized options, reminiscent of xAI’s proprietary C-based stack, which just lately claimed increased GPU effectivity. As AI analysis budgets develop however stay finite, improvements like NVFP4 will possible drive adoption of frameworks and {hardware} that maximize return on compute funding.
Getting Began
The NVFP4 coaching recipe is obtainable by the MaxText framework on the NVIDIA JAX Toolbox GitHub repository. Builders can experiment with two NVFP4 modes—one with Random Hadamard Remodel (RHT) for improved convergence and one other with out RHT for minimal overhead. The general public NVIDIA MaxText container ghcr.io/nvidia/jax:maxtext consists of all mandatory libraries to start coaching on Blackwell GPUs.
For groups exploring cost-efficient massive mannequin coaching, NVFP4 provides a sturdy resolution. By optimizing throughput with out sacrificing mannequin high quality, NVIDIA and JAX proceed to solidify their place within the ever-demanding world of AI infrastructure.
Picture supply: Shutterstock
