| I'm building a real-time speech translator (STT → LLM translation → TTS) and spent a couple weeks benchmarking every TTS engine I could find — cloud and local. Running on MacBook Air M4, 24GB RAM. Some findings were... not what I expected. Sharing everything because I couldn't find this data anywhere when I started. The setupPipeline: Deepgram Nova-3 (STT, ~300ms) → Groq Llama 3.3 70B (translation, ~200ms) → TTS → speaker The TTS component is the bottleneck. STT and LLM together take ~500ms. If TTS adds another second, the conversation feels like a walkie-talkie. Local TTS benchmarks (Apple M4, warm, same phrases)
Piper is fastest but sounds like a robot from 2015 (and the project got archived Oct 2025). Kokoro 82M is the sweet spot — A+ quality at 370ms for short chunks. Everything above 200M parameters is basically unusable for real-time on Mac. The quantization surprise (this one hurt)Tried to speed up Kokoro on M4:
INT8 is almost 2x slower than fp16 on Apple Silicon. ARM chips are optimized for fp16 ops. Quantization saves RAM but adds type conversion overhead. Burned a full day on this because nothing in the docs mentions it. CoreML doesn't work either — only 37 of 2493 model nodes are supported by the CoreML EP. MLX is also not faster for short texts. PyTorch CPU was paradoxically faster than MLX for short phrases (98ms vs 364ms for 6 chars) due to MLX graph compilation overhead. Cloud TTS: Protocol matters more than providerThis was the biggest shock. Same provider, same model, same text:
Cartesia WebSocket vs sync = 5.5x difference. If you're benchmarking TTS providers with their sync SDK, you're measuring the wrong thing. The cost problem
ElevenLabs is 4-20x more expensive than alternatives with comparable quality. At 1,000 hours/month that's a $5,310 difference. What I ended up withDeepgram Nova-3 → Groq Llama 3.3 70B → StreamChunker (splits into 2-3 word chunks) → Kokoro 82M Total latency to first audio: ~870ms. Google Meet S2ST is ~2000ms. Palabra.ai is ~800ms at $25+/mo. Going open-source soon. The translator runs on Elixir + Rust + Flask. TL;DR
I wrote a longer piece with all 30+ providers, ELO rankings, and detailed per-phrase benchmarks if anyone wants the full data: https://ai.gopubby.com/i-benchmarked-30-voice-ai-engines-and-built-a-real-time-translator-faster-than-google-meet-e6a160def969 Happy to answer questions about any specific provider or setup. [link] [comments] |