Mistral AI Launches Voxtral TTS, an Open Source Speech Model That Rivals ElevenLabs

French AI startup Mistral AI has released Voxtral TTS, an open source text-to-speech model that directly competes with ElevenLabs, Deepgram, and OpenAI in the rapidly growing voice AI market.
A Lightweight Powerhouse
Voxtral TTS is built on an autoregressive transformer architecture with flow-matching, based on Ministral 3B. The model totals 4 billion parameters split across three main components:
- A transformer decoder backbone with 3.4 billion parameters
- A flow-matching acoustic transformer with 390 million parameters
- A neural audio codec with 300 million parameters featuring semantic vector quantization
Its compact size means it can run on consumer hardware: modern laptops, mid-range desktop GPUs, and even some high-end mobile devices.
Voice Cloning in 3 Seconds
Voxtral TTS's most impressive capability is its ultra-fast voice adaptation. Just 3 seconds of reference audio is enough for the model to capture a speaker's vocal personality, natural pauses, rhythm, intonation, and emotional expressions.
The model also supports zero-shot cross-lingual voice transfer, generating speech in one language using a voice sample from another — for instance, producing naturally French-accented English.
9 Languages Supported
Voxtral TTS currently supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. A strategic choice as Mistral targets European, South Asian, and Arabic-speaking markets.
Performance That Matches Market Leaders
According to human evaluations published by Mistral:
- Superior naturalness compared to ElevenLabs Flash v2.5 while maintaining similar time-to-first-audio
- Quality parity with ElevenLabs v3, the market leader's premium model
- 70 ms latency for a typical input (10-second voice sample plus 500 characters)
- Real-time factor of approximately 9.7x
- Native audio generation of up to 2 minutes, with smart interleaving for longer content via the API
Open Source and Accessible
Model weights are available for download on Hugging Face under the Creative Commons BY-NC 4.0 license. For commercial use, Voxtral TTS is accessible through the Mistral API at $0.016 per 1,000 characters, as well as on Mistral Studio and Le Chat.
A Strategic Move for Mistral
With Voxtral TTS, Mistral significantly expands its offering beyond text-based language models. This launch follows the release of Mistral Small 4 at Nvidia's GTC on March 17, as the French startup continues building out its multimodal model ecosystem.
The AI speech synthesis market is experiencing explosive growth, driven by demand for voice agents in customer service, virtual assistants, and conversational interfaces. By offering an open source model that rivals proprietary solutions, Mistral positions itself as a credible European alternative for enterprises concerned with technological sovereignty.
Source: Mistral AI
Discuss Your Project with Us
We're here to help with your web development needs. Schedule a call to discuss your project and how we can assist you.
Let's find the best solutions for your needs.