Unmute: Giving Voice to AI — A Deep Dive into Kyutai’s Framework

Voice is the next frontier for Large Language Models (LLMs). While we have seen the rise of native audio models, they often sacrifice the complex reasoning and tool-calling capabilities of top-tier text LLMs. Enter Unmute by Kyutai.

- By Flow

If you are looking to build a real-time, conversational voice agent without losing the sheer power of models like GPT-4o, Claude, or local open-weights models like Gemma 3, Unmute is the open-source framework you need.

Here is a deep dive into what Unmute is, how its low-latency architecture actually works under the hood, and how you can implement it with your favorite text LLM today.

What is Unmute?

Unmute is an open-source, modular orchestration wrapper developed by the AI lab Kyutai. Instead of building a monolithic AI that processes audio directly (like their previous model, Moshi), Kyutai built Unmute to “wrap” any standard text-based LLM with highly optimized Speech-to-Text (STT) and Text-to-Speech (TTS) engines.

The Architectural Trade-off:

Audio-native models typically boast an end-to-end latency of around 160ms–200ms but struggle with tasks like complex tool calling or specific structured formatting. Unmute accepts a slightly higher conversational latency typically between 400ms and 750ms in exchange for giving you full modular control over the LLM “brain”.

This means your voice assistant can now securely query your company’s SQL database, fetch live weather APIs, or utilize advanced Retrieval-Augmented Generation (RAG) pipelines, all while maintaining a natural, spoken conversation.

How Unmute Works: The Backend Architecture

-By Flow

Unmute achieves its speed by orchestrating asynchronous operations across multiple discrete AI services.

  1. The WebSocket Connection: The user speaks into their browser, and the audio is streamed continuously via a WebSocket protocol heavily inspired by the OpenAI Realtime API format. It streams 24kHz mono audio encoded in the Opus codec.
  2. Kyutai STT & Semantic VAD: The audio hits the Kyutai STT server. Unlike legacy systems that use simple volume thresholds to guess when you stop speaking, Unmute uses Semantic Voice Activity Detection (VAD). The model evaluates the actual linguistic syntax and intonation to predict the probability that you have finished your thought, preventing the AI from interrupting you mid-sentence.
  3. The LLM Generation: The transcribed text is sent to your text LLM of choice.
  4. Streaming-in-Text TTS: As soon as the LLM emits its very first few words, the Unmute backend catches them and pipes them into the kyutai/tts 1.6B model. This TTS engine is capable of streaming-in-text, meaning it begins synthesizing audio before the LLM has even finished generating the sentence.
  5. The QuestManager: Under the hood, all of this chaotic asynchronous Python execution is safely managed by a QuestManager class. It spins up distinct "Quests" (background service lifecycles) for the STT and TTS. If you interrupt the bot by speaking, the QuestManager instantly kills the active TTS generation quest, drops the audio buffer, and starts listening again.

How to Implement Unmute with Any Text LLM

-By Flow

One of Unmute’s greatest strengths is that it communicates with the LLM via standard OpenAI-compatible API structures. You do not need to rewrite the core backend logic to swap out the “brain”; you simply update the environment variables.

By default, Unmute routes to a local instance (like vLLM) or OpenRouter. Here is how you can reconfigure it to point to different LLMs by modifying your docker-compose.yml or setting bash variables.

Option 1: Using OpenAI (GPT-4o)

To use a proprietary model, you just need to point the backend to the OpenAI API endpoint.

Update your docker-compose.yml under the backend service:

backend:
image: unmute-backend:latest
environment:
- KYUTAI_STT_URL=ws://stt:8080
- KYUTAI_TTS_URL=ws://tts:8080
- KYUTAI_LLM_URL=https://api.openai.com/v1
- KYUTAI_LLM_MODEL=gpt-4o
- KYUTAI_LLM_API_KEY=sk-your-openai-api-key

Option 2: Using Local Models with Ollama

If you want total privacy, you can run a model like gemma3 or llama3 locally via Ollama.

First, ensure you have pulled the model locally: ollama pull gemma3.
Next, configure the environment variables. Since Ollama runs locally, you must ensure the Docker container can reach your host machine network.

backend:
image: unmute-backend:latest
extra_hosts:
- "host.docker.internal:host-gateway"
environment:
- KYUTAI_STT_URL=ws://stt:8080
- KYUTAI_TTS_URL=ws://tts:8080
- KYUTAI_LLM_URL=http://host.docker.internal:11434/v1
- KYUTAI_LLM_MODEL=gemma3
- KYUTAI_LLM_API_KEY=ollama # Ollama accepts any string here

Option 3: Running Without Docker

If you are developing locally and running the backend directly via Python scripts (like ./dockerless/start_backend.sh), simply export the variables in your terminal before launching:

export KYUTAI_LLM_URL=https://api.mistral.ai/v1
export KYUTAI_LLM_MODEL=mistral-small-latest
export KYUTAI_LLM_API_KEY=your-mistral-api-key
./dockerless/start_backend.sh

Handling Function Calling & Tools

If your LLM supports tool calling, Unmute manages it seamlessly via a proxy design. You wrap your LLM server in a proxy that exposes an OpenAI-compatible endpoint. When the LLM decides to trigger a tool, the proxy intercepts the request, executes the code, and silently streams a “stalling” text string (e.g., “Let me check that for you…”) back to the Unmute TTS to speak while the data is being fetched. The final text is then generated and spoken out loud.

Conclusion

The Kyutai Unmute framework successfully solves the “text bottleneck” problem for developers wanting to build voice AI. By leveraging highly optimized Rust-based streaming servers for acoustic tokenization alongside asynchronous Python orchestration, Unmute gives any LLM an instantaneous, natural-sounding voice.

Whether you want to build an AI tutor with the reasoning of GPT-4o, or a fully private local assistant powered by Llama 3 via Ollama, the infrastructure is now entirely open-source and ready to deploy.


Unmute: Giving Voice to AI — A Deep Dive into Kyutai’s Framework was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top