Mistral AI Launches Voxtral TTS, an Open Source Speech Model That Rivals ElevenLabs

French AI startup Mistral AI has released Voxtral TTS, an open source text-to-speech model that directly competes with ElevenLabs, Deepgram, and OpenAI in the rapidly growing voice AI market.

A Lightweight Powerhouse

Voxtral TTS is built on an autoregressive transformer architecture with flow-matching, based on Ministral 3B. The model totals 4 billion parameters split across three main components:

A transformer decoder backbone with 3.4 billion parameters
A flow-matching acoustic transformer with 390 million parameters
A neural audio codec with 300 million parameters featuring semantic vector quantization

Its compact size means it can run on consumer hardware: modern laptops, mid-range desktop GPUs, and even some high-end mobile devices.

Voice Cloning in 3 Seconds

Voxtral TTS's most impressive capability is its ultra-fast voice adaptation. Just 3 seconds of reference audio is enough for the model to capture a speaker's vocal personality, natural pauses, rhythm, intonation, and emotional expressions.

The model also supports zero-shot cross-lingual voice transfer, generating speech in one language using a voice sample from another — for instance, producing naturally French-accented English.

9 Languages Supported

Voxtral TTS currently supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. A strategic choice as Mistral targets European, South Asian, and Arabic-speaking markets.

Performance That Matches Market Leaders

According to human evaluations published by Mistral:

Superior naturalness compared to ElevenLabs Flash v2.5 while maintaining similar time-to-first-audio
Quality parity with ElevenLabs v3, the market leader's premium model
70 ms latency for a typical input (10-second voice sample plus 500 characters)
Real-time factor of approximately 9.7x
Native audio generation of up to 2 minutes, with smart interleaving for longer content via the API

Open Source and Accessible

Model weights are available for download on Hugging Face under the Creative Commons BY-NC 4.0 license. For commercial use, Voxtral TTS is accessible through the Mistral API at $0.016 per 1,000 characters, as well as on Mistral Studio and Le Chat.

A Strategic Move for Mistral

With Voxtral TTS, Mistral significantly expands its offering beyond text-based language models. This launch follows the release of Mistral Small 4 at Nvidia's GTC on March 17, as the French startup continues building out its multimodal model ecosystem.

The AI speech synthesis market is experiencing explosive growth, driven by demand for voice agents in customer service, virtual assistants, and conversational interfaces. By offering an open source model that rivals proprietary solutions, Mistral positions itself as a credible European alternative for enterprises concerned with technological sovereignty.

Source: Mistral AI