FlashAttention-4 Hits 1,605 TFLOPS on NVIDIA Blackwell GPUs

NVIDIA has launched FlashAttention-4, the most recent optimization for transformer neural networks that squeezes 1,605 TFLOPS out of its Blackwell structure—capturing 71% of the {hardware}’s theoretical most efficiency.

The announcement issues for anybody watching AI infrastructure investments. As massive language fashions push towards longer context home windows, the eye mechanism’s quadratic reminiscence complexity turns into a brutal bottleneck. FlashAttention-4 assaults this drawback immediately, and the benchmark numbers recommend significant features for manufacturing AI workloads.

What the Numbers Present

On the B200 GPU, FA4 delivers a 3.6x speedup over FlashAttention-2 throughout ahead passes at 32,768 sequence size. Backward move efficiency hits 3.15x quicker than FA2 beneath the identical situations. In opposition to current frameworks, FA4 posts 1.3x enchancment over cuDNN and a couple of.4x over Triton Inference Server implementations.

The reminiscence effectivity features are equally vital. Commonplace consideration scales at O(N²) with sequence size—that means doubling your context window quadruples reminiscence necessities. FA4 brings this right down to O(N) by tiling and incremental softmax normalization. NVIDIA claims 20x decrease reminiscence utilization in comparison with PyTorch baselines.

{Hardware}-Software program Co-Design

FA4 was constructed particularly for Blackwell’s quirks. The structure presents an uneven scaling drawback: compute energy roughly doubles whereas reminiscence bandwidth would not maintain tempo. Conventional approaches go away tensor cores sitting idle whereas ready for knowledge.

The answer leverages Blackwell’s devoted Tensor Reminiscence (TMEM)—256 KB of on-chip reminiscence per streaming multiprocessor. By storing intermediate calculations immediately in TMEM as an alternative of shared reminiscence, FA4 sidesteps the bandwidth bottleneck that will in any other case throttle the quicker compute models.

Bigger tile sizes (as much as 128×128) and deeper pipelines maintain the {hardware} busy. The backward move—usually the slower half of coaching—advantages from bypassing register accumulation solely.

Manufacturing Integration

Main inference frameworks together with SGLang and vLLM already help FA4 prefill operations. NVIDIA has integrated these strategies into cuDNN 9.14, making the optimizations accessible to builders with out customized kernel work.

For AI corporations burning by compute budgets, the effectivity features translate on to value financial savings. A 3x+ speedup on coaching passes means both quicker iteration cycles or the power to coach bigger fashions inside current infrastructure constraints.

The broader pattern right here: as transformer fashions develop, algorithmic effectivity on the kernel degree turns into as essential as uncooked {hardware} functionality. FlashAttention-4 represents the present frontier of that optimization work.

Picture supply: Shutterstock

Supply hyperlink

What's Hot

Mt. Gox Strikes $739M in Bitcoin Forward of Deadline – Bitbo

Ripple RLUSD stablecoin launch Turkey by way of BiLira, Bitexen

Robinhood Simply Acquired Canada's Largest Crypto Platform — And Introduced 300,000 New Clients With It

FlashAttention-4 Hits 1,605 TFLOPS on NVIDIA Blackwell GPUs

Ripple RLUSD stablecoin launch Turkey by way of BiLira, Bitexen

Fundstrat’s Tom Lee Says 2027 and 2028 Might Witness the ‘Largest Good points within the Inventory Market in Our Lifetime’ – Right here’s Why – The Every day Hodl

MoneyGram launches stablecoin on Stellar, becoming a member of rush towards digital greenback funds

Dogecoin Eyes Mainstream Adoption as Paxos Opens New Monetary Pathways – Right here Is Why It Issues – BlockNews

Mt. Gox Strikes $739M in Bitcoin Forward of Deadline – Bitbo

Creator of Legendary 700% XRP Prediction Reacts to Crypto Market Collapse, Reveals Bitcoin Value Outlook – U.At present

Bitcoin Dominance Crashes as BTC Value Dumps Under $70K: Market Watch

Bitcoin Eyes 66k Threshold as June 3 Settlement Attracts Sturdy Cap-Led Bets

Bitcoin Worth Motion Sees First Sub-$70,000 Dip Since Mid-April

Why Did Bitcoin Drop Beneath $70,000? Two Names Clarify It

Dealer Claims Polymarket Scammed Him for $500K on MicroStrategy’s Bitcoin Sale Market

Bitcoin's largest ETF selloff but hits $3.4 billion as AI shares maintain climbing

Top Insights

CZ Dismisses Rumors Of Trump Household Talks To Purchase Stake In Binance US | Bitcoinist.com

Apple CEO Transition: The Quiet Crypto Angle

Crypto Analyst Hints at ‘Most Legendary’ Backside for Bitcoin, Updates Outlook on Ethereum and XRP – The Each day Hodl

What's Hot

FlashAttention-4 Hits 1,605 TFLOPS on NVIDIA Blackwell GPUs

What the Numbers Present

{Hardware}-Software program Co-Design

Manufacturing Integration

Related Posts

Subscribe to Updates