Audio & Speech
Text-to-Speech (TTS) and Speech-to-Text (STT) models.
| Rank | Model | Price | Summary |
|---|---|---|---|
|
1
|
Usage Based | The Quality Standard. v3 introduces 'Audio Tags' (e.g., [whisper], [laugh]), allowing for directorial control over emotion. Its new 'Pulse' model supports native multi-speaker generation without stitching audio files. | |
|
2
|
Usage Based | The Speed King. Built on State Space Models (SSMs) rather than Transformers, it achieves 40ms latency. In blind AB testing, it is consistently rated more 'conversational' than ElevenLabs for real-time agents. | |
|
3
|
Usage Based | The Native Speaker. It bypasses text entirely (Speech-to-Speech), allowing for 'barge-in' interruptions and non-verbal cues (breaths, uh-huhs) that text-based pipelines miss entirely. | |
|
4
|
Open Source | The Open Source Leader. Uses a 'Dual Auto-Regressive' architecture to clone voices with just 10 seconds of audio. It creates the most robust multilingual clones, preserving accents better than paid alternatives. | |
|
5
|
Open Weights | The Efficiency Miracle. An open-weight model with only 82M parameters. It runs faster than real-time on a standard CPU while delivering quality that rivals 3B+ parameter models. Perfect for local devices. | |
|
6
|
Subscription | The Podcaster. Famous for its 'Parrot' mode which mimics the exact intonation of a reference file. It is the preferred choice for long-form content generation where consistency over 10+ minutes is key. | |
|
7
|
Usage Based | The Enterprise Voice. Optimized strictly for high-throughput call centers. While less expressive than ElevenLabs, it is unbreakable at scale and pairs perfectly with Deepgram's STT for sub-second loops. | |
|
8
|
Open Source | The End-to-End Open Option. A full speech-text-speech model that runs locally. It excels at handling 'overlapping speech' and interruptions, making it the best open-source foundation for conversational assistants. | |
|
9
|
Usage Based | The Gaming Voice. Designed specifically for interactive media. Its SDK allows developers to modify prosody (speed/pitch) in real-time based on game state (e.g., character is running vs walking). | |
|
10
|
Open Source | The Streaming Specialist. Capable of 'Zero-Shot' cloning with varying emotional control. It is widely used in the Asian market for its superior handling of tonal languages and mixed-language (code-switching) speech. |
Just the Highlights
ElevenLabs v3
The Quality Standard. v3 introduces 'Audio Tags' (e.g., [whisper], [laugh]), allowing for directorial control over emotion. Its new 'Pulse' model supports native multi-speaker generation without stitching audio files.
Cartesia Sonic 2
The Speed King. Built on State Space Models (SSMs) rather than Transformers, it achieves 40ms latency. In blind AB testing, it is consistently rated more 'conversational' than ElevenLabs for real-time agents.
OpenAI Realtime API
The Native Speaker. It bypasses text entirely (Speech-to-Speech), allowing for 'barge-in' interruptions and non-verbal cues (breaths, uh-huhs) that text-based pipelines miss entirely.
Fish Speech 1.5
The Open Source Leader. Uses a 'Dual Auto-Regressive' architecture to clone voices with just 10 seconds of audio. It creates the most robust multilingual clones, preserving accents better than paid alternatives.
Kokoro 82M
The Efficiency Miracle. An open-weight model with only 82M parameters. It runs faster than real-time on a standard CPU while delivering quality that rivals 3B+ parameter models. Perfect for local devices.
PlayHT Turbo 2.0
The Podcaster. Famous for its 'Parrot' mode which mimics the exact intonation of a reference file. It is the preferred choice for long-form content generation where consistency over 10+ minutes is key.
Deepgram Aura
The Enterprise Voice. Optimized strictly for high-throughput call centers. While less expressive than ElevenLabs, it is unbreakable at scale and pairs perfectly with Deepgram's STT for sub-second loops.
Kyutai Moshi
The End-to-End Open Option. A full speech-text-speech model that runs locally. It excels at handling 'overlapping speech' and interruptions, making it the best open-source foundation for conversational assistants.
LMNT
The Gaming Voice. Designed specifically for interactive media. Its SDK allows developers to modify prosody (speed/pitch) in real-time based on game state (e.g., character is running vs walking).
CosyVoice 2 (Alibaba)
The Streaming Specialist. Capable of 'Zero-Shot' cloning with varying emotional control. It is widely used in the Asian market for its superior handling of tonal languages and mixed-language (code-switching) speech.