Voxtral

VOICE AI THAT UNDERSTANDS, NOT JUST HEARS

Listen Deeper.
Build Smarter.

Transform raw audio into refined intelligence. With native Q&A, real-time insights, and multilingual understanding, Voxtral turns every sound into an opportunity.

Trusted by 50K+ developers to transform 10M+ minutes of audio into actionable intelligence daily

Voxtral

Intelligent Audio Processing

Upload your audio files and transform them into transcriptions, summaries, and actionable insights

Audio Processor

Upload your audio file and let our AI provide transcription, analysis, and insights

Audio File

Click to upload audio file

Supported: MP3, WAV, M4A, FLAC, OGG (Max 50MB)

Processing Model

Additional Context (Optional)

0/500

TRY VOXTRAL

Voice Generation Playground

Transform text into natural speech, clone any voice with seconds of audio, or apply real-time voice effects—all with a single API call

Voice Generator

Enter text or upload audio to generate, clone, or transform voices instantly

Text to Speech Input

Voice Model

Voice Settings

Live Voice-to-Text Demo

Experience the real-time speech transcription capabilities of Voxtral with our interactive demonstration

Select Audio Example

Choose from our collection of demo audio files

French

Native French Speaker • 15s • French

French man speaking English

French Speaker • 16s • English (French accent)

Noisy street

Person on Street • 5s • English

Hindi mixed with English

Business Professional • 14s • Hindi-English

Live Transcription

French • Native French Speaker

0:00 / 0:15

Click play to start transcription...

Words

French

Language

99%

Accuracy

GIVE YOUR IDEAS A VOXTRAL VOICE

Why Choose Voxtral?

A single developer-friendly API for complete voice technology stack

Studio-Grade Quality

Neural TTS up to 48kHz delivering professional-grade audio output

Instant Voice Cloning

Clone any voice with remarkable accuracy from just seconds of audio

Real-Time Voice Effects

Live vocal effects designed for gaming and streaming applications

Multilingual Support

Natural-sounding voices in multiple languages for global audiences

Ultra-Low Latency

Optimized streaming technology for real-time applications

Developer Friendly

Flexible SDKs and competitive per-character pricing for easy integration

From interactive games to AI assistants, audiobooks to voice dubbing

Voxtral delivers the complete voice technology stack your application needs, letting you focus on innovation instead of managing complex services

FRONTIER SPEECH UNDERSTANDING

Beyond Simple Transcription

State-of-the-art speech intelligence with native semantic understanding, available in 24B and 3B variants under Apache 2.0 license

Long-form Context Processing

With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding—perfect for meetings, podcasts, and extended conversations.

Built-in Q&A and Summarization

Ask questions directly about audio content or generate structured summaries without chaining separate ASR and language models. Get instant insights from voice data.

Natively Multilingual

Automatic language detection with state-of-the-art performance in English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian, and more. Serve global audiences with one system.

Function-calling from Voice

Enable direct triggering of backend functions, workflows, or API calls based on spoken user intents. Turn voice interactions into actionable commands without intermediate parsing.

Dual Model Variants

Choose between Voxtral (24B) for production-scale applications or Voxtral Mini (3B) for edge deployments. Both available under Apache 2.0 license for maximum flexibility.

Unbeatable Value

Starting at $0.001 per minute—less than half the price of comparable APIs. Outperforms Whisper large-v3 and matches ElevenLabs Scribe quality at a fraction of the cost.

Industry-Leading Performance

Voxtral builds on Mistral's cutting-edge AI models, delivering state-of-the-art accuracy in both speech recognition and text understanding

Audio Performance

Average Word Error Rate (WER) across FLEURS, Mozilla Common Voice, and Multilingual LibriSpeech benchmarks

Transcription AccuracyState-of-the-art

Supported Languages8+ Languages

Max Audio Length40 minutes

Text Understanding

Comprehensive text performance metrics showcasing Voxtral's superior language understanding capabilities

Model Size24B Parameters

Context Length32k Tokens

Base ModelMistral Small 3.1

30min

Transcription

40min

Understanding

Languages

Q&A

Built-in

ENTERPRISE READY

Advanced Enterprise Features

Scale your voice infrastructure with production-grade deployment options, custom fine-tuning, and dedicated support

Private Deployment

Deploy Voxtral entirely within your infrastructure for maximum security and compliance. Ideal for regulated industries with strict data privacy requirements.

•Multi-GPU and multi-node deployment
•Quantized builds for cost efficiency
•Production-scale inference optimization

Domain-Specific Fine-Tuning

Work with Mistral AI's applied team to adapt Voxtral to your specialized context and improve accuracy for your specific use case.

•Legal, medical, and technical domains
•Customer support optimization
•Internal knowledge base integration

Coming soon: Speaker identification, emotion detection, advanced diarization, and extended context windows for even more powerful voice experiences.

TRANSPARENT PRICING

Simple, Scalable Pricing

Pay only for what you use. No hidden fees, no minimum commitments. Scale from prototype to production with confidence.

Free

$0/month

Perfect for testing and prototyping

5 hours of audio processing

1,000 voice generations

✓Basic transcription
✓Standard voice models
✓API access
✓Community support

Pro

$0.001/min

Scale your voice applications

Pay per minute of audio processed

Unlimited voice generations

✓Advanced transcription & Q&A
✓Voice cloning & effects
✓Real-time processing
✓Priority support
✓99.9% uptime SLA

Enterprise

Custom

For large-scale deployments

Volume discounts available

Dedicated infrastructure

✓Private cloud deployment
✓Custom model fine-tuning
✓Advanced analytics
✓24/7 dedicated support
✓Custom SLA

All plans include our state-of-the-art models, multilingual support, and developer-friendly APIs. No setup fees, no hidden costs.

✓30-day money-back guarantee

✓Cancel anytime

✓Volume discounts available

GET STARTED TODAY

Ready to transform audio into intelligence? Build smarter voice experiences with Voxtral's AI-powered understanding.

GET STARTED WITH VOXTRAL

Frequently Asked Questions

Everything you need to know about Voxtral's voice generation API

How fast can Voxtral generate speech from text?

Voxtral generates speech in real-time with ultra-low latency. Most text-to-speech requests complete in under 200ms, making it perfect for interactive applications and real-time voice experiences.

What's required to clone a voice with Voxtral?

Voice cloning with Voxtral requires just 10-30 seconds of high-quality audio from the target voice. The audio should be clear, without background noise, and capture the natural speaking style of the person. We support MP3, WAV, and M4A formats.

Which programming languages have Voxtral SDKs?

Voxtral provides official SDKs for Python, JavaScript/TypeScript, Java, Go, and Ruby. We also offer a RESTful API that can be integrated with any programming language. All SDKs are open-source and available on GitHub.

Can I use Voxtral for commercial projects?

Yes! All Voxtral plans, including the free tier, allow commercial use. You own all content generated through our API. We recommend reviewing our Terms of Service for specific use cases and ensuring you have necessary rights for voice cloning.

How does Voxtral compare to ElevenLabs or Play.ht?

Voxtral offers superior multilingual support, deeper audio understanding with built-in Q&A capabilities, and significantly lower pricing (starting at $0.001/min). Our models are powered by Mistral AI, providing state-of-the-art accuracy with both 24B and 3B model variants.

Is there a free tier or trial available?

Yes! Our free tier includes 5 hours of audio processing and 1,000 voice generations per month. No credit card required. This gives you full access to test all features including transcription, voice generation, and basic voice cloning.

Can Voxtral handle real-time voice conversations?

Absolutely. Voxtral is optimized for real-time applications with sub-200ms latency. Our streaming API supports bidirectional audio, making it ideal for voice assistants, gaming, and live translation applications.

What audio formats does Voxtral support?

Voxtral supports all major audio formats including MP3, WAV, M4A, FLAC, OGG, and AAC. We can output in various bitrates and sample rates up to 48kHz for studio-quality audio. The API automatically handles format conversion.

Listen Deeper.Build Smarter.

Intelligent Audio Processing

Voice Generation Playground

Live Voice-to-Text Demo

French

French man speaking English

Noisy street

Hindi mixed with English

Why Choose Voxtral?

Studio-Grade Quality

Instant Voice Cloning

Real-Time Voice Effects

Multilingual Support

Ultra-Low Latency

Developer Friendly

From interactive games to AI assistants, audiobooks to voice dubbing

Beyond Simple Transcription

Industry-Leading Performance

Audio Performance

Text Understanding

Advanced Enterprise Features

Private Deployment

Domain-Specific Fine-Tuning

Simple, Scalable Pricing

Free

Pro

Enterprise

Ready to transform audio into intelligence? Build smarter voice experiences with Voxtral's AI-powered understanding.

Frequently Asked Questions

Listen Deeper.
Build Smarter.