Zach Anderson
Apr 18, 2026 00:53
Elon Musk’s xAI releases Grok Speech to Textual content and Textual content to Speech APIs at $0.10/hour, claiming lowest error charges throughout enterprise transcription benchmarks.

Elon Musk’s xAI dropped two standalone audio APIs on April 17, positioning Grok’s speech know-how as a direct competitor to ElevenLabs, Deepgram, and AssemblyAI at aggressive worth factors.
The Grok Speech to Textual content API runs $0.10 per hour for batch processing and $0.20 per hour for real-time streaming. Textual content to Speech is available in at $4.20 per million characters. Each leverage the identical infrastructure powering Tesla automobiles and Starlink buyer assist.
Benchmark Claims Price Scrutinizing
xAI’s printed phrase error charges inform an attention-grabbing story. On cellphone name entity recognition—suppose names, account numbers, dates—Grok STT claims 5.0% error price versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. That is a major hole if it holds up in manufacturing.
The corporate demonstrated this with a difficult check case: transcribing Welsh names like “Anghared Llewelyn Bowen” and “Oisin MacGiolla Phadraig” alongside mortgage particulars. Grok nailed it with zero errors. Competing fashions came across pronunciations and formatted dates inconsistently.
Video and podcast transcription reveals tighter competitors—Grok and ElevenLabs tied at 2.4% error price, with Deepgram and AssemblyAI trailing barely at 3.0% and three.2% respectively.
Technical Options for Builders
Past uncooked transcription, xAI inbuilt options that enterprise prospects really need: word-level timestamps, speaker diarization throughout a number of audio channels, and assist for 25+ languages with seamless switching.
The Inverse Textual content Normalization characteristic routinely converts spoken numbers, dates, and currencies into correct codecs. “4 one 4 5 5 5 one two three 4” turns into a cellphone quantity. “Six ninety-nine” turns into $6.99. Small element, nevertheless it eliminates post-processing complications.
Textual content to Speech contains inline tags for prosody management—whispers, laughs, sighs, emphasis, pacing changes. Builders can inject emotional nuance with out wrestling with advanced audio markup.
Strategic Context
This launch follows xAI’s acquisition of X Corp in March 2025 and comes as the corporate expands its infrastructure partnerships. Simply two days earlier than the API announcement, studies emerged that xAI plans to provide computing energy to Cursor, the AI-powered coding startup.
The Colossus supercomputer, operational since December 2024, offers the backend muscle. xAI seems to be monetizing that capability throughout a number of verticals—enterprise AI, developer instruments, and now voice APIs.
For builders constructing voice brokers or transcription instruments, the pricing undercuts established gamers considerably. Whether or not Grok’s accuracy claims survive real-world deployment at scale stays the open query. The documentation and price limits can be found by way of xAI’s API console for these prepared to check it.
Picture supply: Shutterstock
