Been building this for a while and finally cleaned it up enough to share.
voice-agents-from-scratch is a numbered, chapter-by-chapter repo that walks the full real-time pipeline:
- Microphone capture
- Whisper for STT
- Local GGUF LLM (via llama.cpp)
- Kokoro for TTS
- Speaker output
Everything streams - you don't wait for the full LLM response before TTS starts speaking. That's the part that makes it feel like a real conversation instead of a chatbot with a voice skin.
Each chapter is a runnable script + a short CODE.md walkthrough. There's also a small shared library so you can see how the pieces compose into a real system, not just isolated calls.
Why fully local matters here: you can actually see where latency lives. Warm-up, first-audio time, streaming chunk size - these aren't abstractions when you're running it on your own machine.
Repo: https://github.com/pguso/voice-agents-from-scratch
Happy to answer questions about the architecture or tradeoffs I ran into.
[link] [comments]