In short
- Xiaomi and inference companion TileRT have damaged 1,000 tokens per second on a 1-trillion-parameter mannequin, a primary at that scale, utilizing a normal 8-GPU commodity node—not customized chips.
- The velocity comes from FP4 quantization on the mannequin’s professional layers and DFlash speculative decoding, which proposes a full block of tokens in a single move as a substitute of separately.
- A restricted API trial opens June 9 by way of June 23, priced at 3× customary MiMo charges for roughly 10× the era velocity.
Most individuals know Xiaomi because the Chinese language cellphone model. The one which makes low-cost electrical scooters and air purifiers. Not precisely the corporate you’d count on to interrupt a significant AI inference velocity file on a Monday morning.
And but. Xiaomi simply launched MiMo-V2.5-Professional-UltraSpeed, a serving mode for its trillion-parameter flagship that hits over 1,000 tokens per second—peaking close to 1,200 in demos.
Parameters are the interior numerical weights that outline how a mannequin thinks—the extra you have got, the extra advanced the patterns it will possibly acknowledge. Tokens are the chunks of textual content the mannequin reads and writes, roughly three-quarters of a phrase every on common.
Xiaomi did it on a single 8-GPU commodity node. Commonplace {hardware}, no customized chips. That adjustments the calculus for who can truly deploy this sort of velocity in manufacturing.
To place that quantity in human phrases: per Synthetic Evaluation, GPT-5.5—what most ChatGPT customers are literally speaking to—sits at 68. Claude Opus 4.6 lands round 71 with the decrease finish mannequin, Haiku, touching 98 tokens per second. Gemini Flash hits 192 tokens per second. MiMo-V2.5-Professional-UltraSpeed does 1,000, on a mannequin that matches Opus on coding benchmarks.

Cerebras and Groq constructed whole companies round this drawback. Cerebras designed a wafer-scale chip the scale of a dinner plate, packing 44GB of on-chip reminiscence to remove the bandwidth bottleneck that slows down GPU inference. It hit 969 tokens per second on Meta’s Llama 3.1 405B—spectacular, however that is a 405-billion-parameter mannequin, lower than half the scale of MiMo-V2.5-Professional. Groq’s customized Language Processing Unit structure tops out round 300–750 tokens per second relying on mannequin.
Neither runs on {hardware} you possibly can lease from AWS tonight.
Xiaomi did it on commodity GPUs by way of software program alone—a mix of model-level tips and a purpose-built inference engine known as TileRT.
What’s truly occurring below the hood
Two strategies carry the velocity. The primary method is known as FP4 Quantization: as a substitute of working the mannequin at full 8-bit or 16-bit numerical precision, Xiaomi shrinks the professional layers—which make up many of the 1 trillion parameters—all the way down to 4-bit. Reminiscence footprint drops, bandwidth stress drops, velocity goes up. The catch is often a small high quality degradation. Xiaomi’s repair is surgical: solely the professional layers get compressed, every little thing else stays at full precision. With this strategy, high quality loss is described as near-zero.
The second is DFlash speculative decoding. Regular speculative decoding has a small draft mannequin guess the subsequent few tokens, then the massive mannequin verifies them in parallel. DFlash skips the sequential drafting totally—it fills a complete block of masked positions in a single ahead move. In coding duties, the massive mannequin accepts a median of 6.3 out of 8 proposed tokens per verification spherical. That is six tokens confirmed in a single step as a substitute of 1.
TileRT ties it collectively. It retains your complete compute pipeline constantly resident contained in the GPU—no per-operator launch overhead, no execution gaps.
Xiaomi calls this strategy “excessive model-system codesign,” and the phrase is correct: Neither method alone will get to 1,000 tokens per second, however the synergy amongst all approaches does.
MiMo-V2.5-Professional is a frontier-level mannequin. We coated the V2.5 Professional launch in April—it matches Claude Opus on most coding benchmarks and runs at roughly $0.43 enter / $0.87 output per million tokens. Opus prices $5 enter / $25 output per million tokens.
UltraSpeed accelerates that precise MiMo V2.5 Professional mannequin, not a stripped-down model.
Quick sufficient inference adjustments how you should use a mannequin. You may run dozens of reasoning paths in parallel as a substitute of ready on one reply. Fraud detection, buying and selling sign era, real-time agent loops—all of those have laborious latency constraints that 60 tokens per second cannot meet. At 1,000 tokens per second, they will.
Xiaomi is pricing the velocity at 3 occasions the usual MiMo-V2.5-Professional charge for roughly 10 occasions the output. The API trial runs June 9–23, application-based, with precedence given to enterprise {and professional} builders. The FP4-DFlash checkpoint is already open-sourced on Hugging Face for neighborhood testing.
Each day Debrief E-newsletter
Begin every single day with the highest information tales proper now, plus authentic options, a podcast, movies and extra.
