Built a Voice Agents from Scratch GitHub tutorial: mic > Whisper > local LLM (GGUF) > Kokoro > speaker, fully local, no API keys

Been building this for a while and finally cleaned it up enough to share.

voice-agents-from-scratch is a numbered, chapter-by-chapter repo that walks the full real-time pipeline:

  • Microphone capture
  • Whisper for STT
  • Local GGUF LLM (via llama.cpp)
  • Kokoro for TTS
  • Speaker output

Everything streams - you don't wait for the full LLM response before TTS starts speaking. That's the part that makes it feel like a real conversation instead of a chatbot with a voice skin.

Each chapter is a runnable script + a short CODE.md walkthrough. There's also a small shared library so you can see how the pieces compose into a real system, not just isolated calls.

Why fully local matters here: you can actually see where latency lives. Warm-up, first-audio time, streaming chunk size - these aren't abstractions when you're running it on your own machine.

Repo: https://github.com/pguso/voice-agents-from-scratch

Happy to answer questions about the architecture or tradeoffs I ran into.

submitted by /u/purellmagents
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top