Reviews - TheAIStack.org

Rank	Model	Price	Summary
1	ElevenLabs v3	Usage Based	The Quality Standard. v3 introduces 'Audio Tags' (e.g., [whisper], [laugh]), allowing for directorial control over emotion. Its new 'Pulse' model supports native multi-speaker generation without stitching audio files.
2	Cartesia Sonic 2	Usage Based	The Speed King. Built on State Space Models (SSMs) rather than Transformers, it achieves 40ms latency. In blind AB testing, it is consistently rated more 'conversational' than ElevenLabs for real-time agents.
3	OpenAI Realtime API	Usage Based	The Native Speaker. It bypasses text entirely (Speech-to-Speech), allowing for 'barge-in' interruptions and non-verbal cues (breaths, uh-huhs) that text-based pipelines miss entirely.
4	Fish Speech 1.5	Open Source	The Open Source Leader. Uses a 'Dual Auto-Regressive' architecture to clone voices with just 10 seconds of audio. It creates the most robust multilingual clones, preserving accents better than paid alternatives.
5	Kokoro 82M	Open Weights	The Efficiency Miracle. An open-weight model with only 82M parameters. It runs faster than real-time on a standard CPU while delivering quality that rivals 3B+ parameter models. Perfect for local devices.
6	PlayHT Turbo 2.0	Subscription	The Podcaster. Famous for its 'Parrot' mode which mimics the exact intonation of a reference file. It is the preferred choice for long-form content generation where consistency over 10+ minutes is key.
7	Deepgram Aura	Usage Based	The Enterprise Voice. Optimized strictly for high-throughput call centers. While less expressive than ElevenLabs, it is unbreakable at scale and pairs perfectly with Deepgram's STT for sub-second loops.
8	Kyutai Moshi	Open Source	The End-to-End Open Option. A full speech-text-speech model that runs locally. It excels at handling 'overlapping speech' and interruptions, making it the best open-source foundation for conversational assistants.
9	LMNT	Usage Based	The Gaming Voice. Designed specifically for interactive media. Its SDK allows developers to modify prosody (speed/pitch) in real-time based on game state (e.g., character is running vs walking).
10	CosyVoice 2 (Alibaba)	Open Source	The Streaming Specialist. Capable of 'Zero-Shot' cloning with varying emotional control. It is widely used in the Asian market for its superior handling of tonal languages and mixed-language (code-switching) speech.

Just the Highlights

ElevenLabs v3

Visit Website

Rank #1

Usage Based

The Quality Standard. v3 introduces 'Audio Tags' (e.g., [whisper], [laugh]), allowing for directorial control over emotion. Its new 'Pulse' model supports native multi-speaker generation without stitching audio files.

Cartesia Sonic 2

Visit Website

Rank #2

Usage Based

The Speed King. Built on State Space Models (SSMs) rather than Transformers, it achieves 40ms latency. In blind AB testing, it is consistently rated more 'conversational' than ElevenLabs for real-time agents.

OpenAI Realtime API

Visit Website

Rank #3

Usage Based

The Native Speaker. It bypasses text entirely (Speech-to-Speech), allowing for 'barge-in' interruptions and non-verbal cues (breaths, uh-huhs) that text-based pipelines miss entirely.

Fish Speech 1.5

Visit Website

Rank #4

Open Source

The Open Source Leader. Uses a 'Dual Auto-Regressive' architecture to clone voices with just 10 seconds of audio. It creates the most robust multilingual clones, preserving accents better than paid alternatives.

Kokoro 82M

Visit Website

Rank #5

Open Weights

The Efficiency Miracle. An open-weight model with only 82M parameters. It runs faster than real-time on a standard CPU while delivering quality that rivals 3B+ parameter models. Perfect for local devices.

PlayHT Turbo 2.0

Visit Website

Rank #6

Subscription

The Podcaster. Famous for its 'Parrot' mode which mimics the exact intonation of a reference file. It is the preferred choice for long-form content generation where consistency over 10+ minutes is key.

Deepgram Aura

Visit Website

Rank #7

Usage Based

The Enterprise Voice. Optimized strictly for high-throughput call centers. While less expressive than ElevenLabs, it is unbreakable at scale and pairs perfectly with Deepgram's STT for sub-second loops.

Kyutai Moshi

Visit Website

Rank #8

Open Source

The End-to-End Open Option. A full speech-text-speech model that runs locally. It excels at handling 'overlapping speech' and interruptions, making it the best open-source foundation for conversational assistants.

LMNT

Visit Website

Rank #9

Usage Based

The Gaming Voice. Designed specifically for interactive media. Its SDK allows developers to modify prosody (speed/pitch) in real-time based on game state (e.g., character is running vs walking).

CosyVoice 2 (Alibaba)

Visit Website

Rank #10

Open Source

The Streaming Specialist. Capable of 'Zero-Shot' cloning with varying emotional control. It is widely used in the Asian market for its superior handling of tonal languages and mixed-language (code-switching) speech.

Audio & Speech

Just the Highlights

ElevenLabs v3

Cartesia Sonic 2

OpenAI Realtime API

Fish Speech 1.5

Kokoro 82M

PlayHT Turbo 2.0

Deepgram Aura

Kyutai Moshi

LMNT

CosyVoice 2 (Alibaba)