Transform raw audio into refined intelligence. With native Q&A, real-time insights, and multilingual understanding, Voxtral turns every sound into an opportunity.
Trusted by 50K+ developers to transform 10M+ minutes of audio into actionable intelligence daily
Upload your audio files and transform them into transcriptions, summaries, and actionable insights
Click to upload audio file
Supported: MP3, WAV, M4A, FLAC, OGG (Max 50MB)
Transform text into natural speech, clone any voice with seconds of audio, or apply real-time voice effects—all with a single API call
Experience the real-time speech transcription capabilities of Voxtral with our interactive demonstration
Native French Speaker • 15s • French
French Speaker • 16s • English (French accent)
Person on Street • 5s • English
Business Professional • 14s • Hindi-English
Click play to start transcription...
A single developer-friendly API for complete voice technology stack
Neural TTS up to 48kHz delivering professional-grade audio output
Clone any voice with remarkable accuracy from just seconds of audio
Live vocal effects designed for gaming and streaming applications
Natural-sounding voices in multiple languages for global audiences
Optimized streaming technology for real-time applications
Flexible SDKs and competitive per-character pricing for easy integration
Voxtral delivers the complete voice technology stack your application needs, letting you focus on innovation instead of managing complex services
State-of-the-art speech intelligence with native semantic understanding, available in 24B and 3B variants under Apache 2.0 license
With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding—perfect for meetings, podcasts, and extended conversations.
Ask questions directly about audio content or generate structured summaries without chaining separate ASR and language models. Get instant insights from voice data.
Automatic language detection with state-of-the-art performance in English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian, and more. Serve global audiences with one system.
Enable direct triggering of backend functions, workflows, or API calls based on spoken user intents. Turn voice interactions into actionable commands without intermediate parsing.
Choose between Voxtral (24B) for production-scale applications or Voxtral Mini (3B) for edge deployments. Both available under Apache 2.0 license for maximum flexibility.
Starting at $0.001 per minute—less than half the price of comparable APIs. Outperforms Whisper large-v3 and matches ElevenLabs Scribe quality at a fraction of the cost.
Voxtral builds on Mistral's cutting-edge AI models, delivering state-of-the-art accuracy in both speech recognition and text understanding
Average Word Error Rate (WER) across FLEURS, Mozilla Common Voice, and Multilingual LibriSpeech benchmarks
Comprehensive text performance metrics showcasing Voxtral's superior language understanding capabilities
Scale your voice infrastructure with production-grade deployment options, custom fine-tuning, and dedicated support
Deploy Voxtral entirely within your infrastructure for maximum security and compliance. Ideal for regulated industries with strict data privacy requirements.
Work with Mistral AI's applied team to adapt Voxtral to your specialized context and improve accuracy for your specific use case.
Coming soon: Speaker identification, emotion detection, advanced diarization, and extended context windows for even more powerful voice experiences.
Pay only for what you use. No hidden fees, no minimum commitments. Scale from prototype to production with confidence.
Perfect for testing and prototyping
Scale your voice applications
For large-scale deployments
All plans include our state-of-the-art models, multilingual support, and developer-friendly APIs. No setup fees, no hidden costs.
Everything you need to know about Voxtral's voice generation API
How fast can Voxtral generate speech from text?
Voxtral generates speech in real-time with ultra-low latency. Most text-to-speech requests complete in under 200ms, making it perfect for interactive applications and real-time voice experiences.
What's required to clone a voice with Voxtral?
Voice cloning with Voxtral requires just 10-30 seconds of high-quality audio from the target voice. The audio should be clear, without background noise, and capture the natural speaking style of the person. We support MP3, WAV, and M4A formats.
Which programming languages have Voxtral SDKs?
Voxtral provides official SDKs for Python, JavaScript/TypeScript, Java, Go, and Ruby. We also offer a RESTful API that can be integrated with any programming language. All SDKs are open-source and available on GitHub.
Can I use Voxtral for commercial projects?
Yes! All Voxtral plans, including the free tier, allow commercial use. You own all content generated through our API. We recommend reviewing our Terms of Service for specific use cases and ensuring you have necessary rights for voice cloning.
How does Voxtral compare to ElevenLabs or Play.ht?
Voxtral offers superior multilingual support, deeper audio understanding with built-in Q&A capabilities, and significantly lower pricing (starting at $0.001/min). Our models are powered by Mistral AI, providing state-of-the-art accuracy with both 24B and 3B model variants.
Is there a free tier or trial available?
Yes! Our free tier includes 5 hours of audio processing and 1,000 voice generations per month. No credit card required. This gives you full access to test all features including transcription, voice generation, and basic voice cloning.
Can Voxtral handle real-time voice conversations?
Absolutely. Voxtral is optimized for real-time applications with sub-200ms latency. Our streaming API supports bidirectional audio, making it ideal for voice assistants, gaming, and live translation applications.
What audio formats does Voxtral support?
Voxtral supports all major audio formats including MP3, WAV, M4A, FLAC, OGG, and AAC. We can output in various bitrates and sample rates up to 48kHz for studio-quality audio. The API automatically handles format conversion.