I benchmarked 30+ TTS engines for a real-time translator on Apple M4. Quantization made things SLOWER. Here’s all the data.

I'm building a real-time speech translator (STT → LLM translation → TTS) and spent a couple weeks benchmarking every TTS engine I could find — cloud and local. Running on MacBook Air M4, 24GB RAM.

Some findings were... not what I expected. Sharing everything because I couldn't find this data anywhere when I started.

The setup

Pipeline: Deepgram Nova-3 (STT, ~300ms) → Groq Llama 3.3 70B (translation, ~200ms) → TTS → speaker

The TTS component is the bottleneck. STT and LLM together take ~500ms. If TTS adds another second, the conversation feels like a walkie-talkie.

Local TTS benchmarks (Apple M4, warm, same phrases)

Model	Size	2-3 words	10 words	Quality
Piper ryan-med	63MB	30-50ms	137ms	B
Kokoro 82M fp16	156MB	370ms	730ms	A+
pocket-tts	100M	260ms	7500ms!	B
ZipVoice 123M	123M	~500ms	1240ms	B+
Chatterbox 500M	500M	6310ms	9100ms	A
Qwen3-TTS 0.6B	600M	~800ms	1600-2000ms	B+
Qwen3-TTS 1.7B	1.7B	~2500ms	5300ms	A

Piper is fastest but sounds like a robot from 2015 (and the project got archived Oct 2025). Kokoro 82M is the sweet spot — A+ quality at 370ms for short chunks.

Everything above 200M parameters is basically unusable for real-time on Mac.

The quantization surprise (this one hurt)

Tried to speed up Kokoro on M4:

Optimization	Result	Verdict
fp16 (default)	373ms	Best
INT8 quantization	687ms	1.8x SLOWER
q8f16	655ms	1.75x SLOWER
CoreML Neural Engine	error	Architecture not supported
1 thread	1723ms	—
4 threads	~730ms	Optimum
8 threads	754ms	Overhead

INT8 is almost 2x slower than fp16 on Apple Silicon. ARM chips are optimized for fp16 ops. Quantization saves RAM but adds type conversion overhead. Burned a full day on this because nothing in the docs mentions it.

CoreML doesn't work either — only 37 of 2493 model nodes are supported by the CoreML EP.

MLX is also not faster for short texts. PyTorch CPU was paradoxically faster than MLX for short phrases (98ms vs 364ms for 6 chars) due to MLX graph compilation overhead.

Cloud TTS: Protocol matters more than provider

This was the biggest shock. Same provider, same model, same text:

Provider	Protocol	TTFB avg
Cartesia Sonic-2	WebSocket	245ms
Cartesia Sonic-2	sync SDK	1361ms
ElevenLabs Flash v2.5	WebSocket	395ms
Hume Octave 2	HTTP stream	800ms
Hume Octave 2	sync	2158ms

Cartesia WebSocket vs sync = 5.5x difference. If you're benchmarking TTS providers with their sync SDK, you're measuring the wrong thing.

The cost problem

Provider	$/hour of voice bot
Hume Octave 2	$0.26
Inworld Mini	$0.17
Cartesia Sonic	$1.26
OpenAI TTS-1	$0.51
ElevenLabs Flash v2.5	$5.57

ElevenLabs is 4-20x more expensive than alternatives with comparable quality. At 1,000 hours/month that's a $5,310 difference.

What I ended up with

Deepgram Nova-3 → Groq Llama 3.3 70B → StreamChunker (splits into 2-3 word chunks) → Kokoro 82M

Total latency to first audio: ~870ms. Google Meet S2ST is ~2000ms. Palabra.ai is ~800ms at $25+/mo.

Going open-source soon. The translator runs on Elixir + Rust + Flask.

TL;DR

Kokoro 82M fp16 is the only viable local TTS for real-time on Mac (370ms, A+ quality)
Don't quantize on Apple Silicon — INT8 is 1.8x slower than fp16
CoreML and MLX don't help for short-text TTS inference
Always benchmark TTS over WebSocket, not sync API (5.5x difference)
ElevenLabs is 4-20x overpriced vs Cartesia/Hume/Inworld
Every serious new open-source TTS model is 0.5B+ params — unusable for real-time on Mac

I wrote a longer piece with all 30+ providers, ELO rankings, and detailed per-phrase benchmarks if anyone wants the full data: https://ai.gopubby.com/i-benchmarked-30-voice-ai-engines-and-built-a-real-time-translator-faster-than-google-meet-e6a160def969

Happy to answer questions about any specific provider or setup.

submitted by /u/Kir_Moisha
[link] [comments]