Caroline Bishop
Feb 04, 2026 16:53
Mistral releases Voxtral Transcribe 2 with real-time streaming at $0.003/min, undercutting rivals whereas matching accuracy. Open weights underneath Apache 2.0.
Mistral AI dropped its second-generation speech-to-text fashions on February 4, 2026, with Voxtral Transcribe 2 delivering what the corporate claims is state-of-the-art transcription at a fraction of competitor pricing. The headline quantity: $0.003 per minute for batch processing, roughly one-fifth the price of ElevenLabs’ Scribe v2.
The discharge consists of two fashions serving completely different use circumstances. Voxtral Mini Transcribe V2 handles batch jobs with speaker diarization and word-level timestamps. Voxtral Realtime targets stay functions with latency configurable all the way down to sub-200 milliseconds—quick sufficient for voice brokers that do not really feel sluggish.
Efficiency Claims Stack Up In opposition to Huge Names
Mistral’s benchmarks present roughly 4% phrase error price on FLEURS, outperforming GPT-4o mini Transcribe, Gemini 2.5 Flash, Meeting Common, and Deepgram Nova on accuracy. The corporate additionally claims 3x sooner processing than ElevenLabs’ Scribe v2 whereas matching high quality.
At 2.4 seconds delay—appropriate for stay subtitling—Realtime matches the batch mannequin’s accuracy. Drop to 480ms and also you’re 1-2% further phrase error price, which Mistral positions as acceptable for conversational AI functions.
Open Weights Change the Deployment Math
Voxtral Realtime ships underneath Apache 2.0, that means enterprises can deploy on-premise with out API calls. With a 4B parameter footprint, the mannequin runs on edge units—a major consideration for healthcare, finance, and different sectors the place audio information cannot depart inside infrastructure.
Each fashions assist GDPR and HIPAA-compliant deployments, addressing the compliance complications which have slowed enterprise AI adoption.
What’s Really New Since July 2025
When Mistral launched the unique Voxtral in July 2025, speaker diarization was notably absent—the corporate was actively searching for design companions for the function. This launch delivers on that promise, including exact speaker attribution with begin/finish instances. Context biasing is one other addition, letting customers feed as much as 100 domain-specific phrases to enhance accuracy on correct nouns and technical vocabulary.
Language assist expanded to 13 languages together with Chinese language, Hindi, Arabic, Japanese, and Korean. Audio file limits jumped to three hours per request, up from the unique 30-40 minute caps.
Enterprise Implications
The pricing construction creates attention-grabbing dynamics for contact facilities and assembly intelligence platforms presently paying premium charges for transcription APIs. At $0.003/min, transcribing one million minutes of audio runs $3,000—a quantity that makes beforehand cost-prohibitive use circumstances all of a sudden viable.
Voxtral Realtime’s $0.006/min pricing for streaming functions nonetheless undercuts most rivals whereas enabling real-time sentiment evaluation and stay agent help options.
Builders can take a look at each fashions instantly via Mistral Studio’s new audio playground or seize Realtime weights straight from Hugging Face.
Picture supply: Shutterstock

