Mistral supercharges voice AI with new models

Feb 4, 2026

3:00pm UTC

Copy link
Share on X
Share on LinkedIn
Share on Instagram
Share via Facebook

AI assistants are going voice-first, and Mistral AI just launched its models to compete.

On Wednesday, the French AI startup launched Voxtral Transcribe 2, its next-generation family of speech-to-text models that boast state-of-the-art transcription quality, speaker diarization, and timestamps, while maintaining ultra-low latency, according to the company. The models are also small enough to run on-device, offering wins in privacy and cost.

“Voxtral Transcribe 2 proves that state-of-the-art transcription can run locally, without compromising accuracy or speed. For businesses and users who demand privacy and control, this changes everything,” said Pierre Stock, VP Science at Mistral AI, to The Deep View.

The launch includes:

  • Voxtral Realtime - A 4 billion parameter model aimed at live transcription, achieving “state of the art” transcription with 480ms latency across 13 languages. It can be configurable down to sub-200ms latency.
  • Voxtral Mini Transcribe V2 - Offers high quality transcriptions at a lower cost, with Mistral claiming it achieves “the lowest word error rate, at the lowest price point.”
  • An audio playground in Mistral Studio where users can test the transcription capabilities offered by Voxtral 2.

Performance on the FLEURS benchmark shows that Voxtral Mini Transcribe V2 performs competitively against models from Gemini and OpenAI, with the lowest diarization error rate.

The models can adjust to speaker accents and jargon across languages, making content accessible to as many people as possible. Real-world enterprise uses include AI-powered customer service and multilingual subtitles. Because it runs on devices, it works great for industries handling sensitive data like healthcare and finance. Staying true to Mistral's open-source approach, they've released the model weights under Apache 2.0 license.

“Open-weight models like Voxtral Realtime aren’t just about transparency - they’re about acceleration. By putting this technology in the hands of developers worldwide, we’re not just releasing a tool; we’re unlocking a wave of innovation where low latency is critical,” added Stock.

Our Deeper View

While AI models and applications continue to advance, they must meet people where they are to be truly useful. That's why multimodality is becoming increasingly important, with voice interfaces leading the charge. Voice is a natural and intuitive way to interact with AI, one we've already become comfortable with through assistants like Siri and Alexa — and, more recently, ChatGPT Voice. Audio AI has already emerged as a key AI trend for 2026, and Mistral's new models are another sign that audio be an important aspect of AI progress this year.