By Muhammad Ali Nasir — April 2026

I got tired of manually deciding which local LLM to use.
Coding question? Load Qwen. Math problem? Switch to a reasoning model. General question? Back to something smaller and faster. Every switch meant waiting 30–60 seconds for the model to load into VRAM. It was exhausting.
So I built a router that does it automatically — classifies every query in under 5ms and routes it to the optimal model based on benchmark scores, historical performance, and user feedback.
This is the story of how it works, what didn’t work, and what surprised me most.
The Problem With Local LLMs
Running LLMs locally is great — zero API costs, full privacy, complete control. But the operational reality is painful:
- Consumer GPUs (8–48GB VRAM) can only hold one model at a time
- No existing local inference tool (Ollama, LM Studio, llama.cpp) has query-aware routing
- You either use one model for everything and accept mediocre results, or manually switch models and accept the friction
The insight that started this project: different models are dramatically better at different task types. A 7B coding-optimized model beats a 32B general model on HumanEval. A reasoning-focused model trained on math competitions outperforms a larger instruction-tuned model on GSM8K.
If you could automatically detect what type of task a query is and route to the right model, you’d get better results than just always using your “best” model — with no manual switching.
The Routing Architecture
The routing pipeline has four stages:
Query → Classify Task → Score All Models → Select Best → Load & Serve
Let me walk through each one.
Stage 1: Task Classification in <5ms
The first question was: how do you classify a query fast enough that it doesn’t add noticeable latency?
My first instinct was to use a small neural classifier — a fine-tuned BERT or a sentence-transformers model. I tried this. The accuracy was good (~88%) but the latency was 40–120ms. That’s unacceptable for a routing layer that fires on every single request.
The solution: TF-IDF + Logistic Regression.
I know — it sounds like 2015. But hear me out.
TF-IDF vectorization with bigrams (max 5,000 features) + a Logistic Regression classifier with balanced class weights achieves:
- ~85% accuracy on 6 task categories
- <5ms inference (usually 1–3ms in practice)
- ~1MB RAM footprint
- Loads from a joblib pickle in milliseconds at startup
The 6 task categories map directly to benchmark strengths:
Task Type Benchmark Signal Example Query coding HumanEval "Write a Python function to..." math GSM8K "If a train leaves at 3pm..." reasoning MMLU-Pro "Explain why the Roman Empire..." instruction MT-Bench "Write a cover letter for..." hard_reasoning GPQA "Derive the Schrödinger equation..." general MMLU-Pro Catch-all fallback
One important detail: if classifier confidence drops below 0.6, the system falls back to general routing rather than making a low-confidence routing decision. Better to be safe than to confidently route wrong.
Stage 2: Multi-Signal Scoring
Once we know the task type, we score every registered model:
final_score = 0.4 × benchmark_score
+ 0.3 × memory_score
+ 0.15 × latency_score
+ 0.15 × feedback_score
Benchmark score (40%): Normalized benchmark result for the classified task type. A model with 72% HumanEval gets a 0.72 for coding queries. These are pulled from HuggingFace model cards, the Open LLM Leaderboard, or local mini-evaluations for models without published scores.
Memory score (30%): This one surprised me most — more on this below.
Latency score (15%): Normalized inverse of historical average latency. 1 - (avg_latency_ms / 10000). A model that typically responds in 2s scores higher than one that takes 8s.
Feedback score (15%): Thumbs up/down from users. Simple ratio: thumbs_up / total_feedback. Low weight because feedback is sparse early on, but compounds meaningfully over time.
Stage 3: The Memory Layer (The Part That Surprised Me)
This is the part I’m most proud of and the part I didn’t expect to matter as much as it does.
The memory layer stores embeddings of every processed query in Qdrant alongside the outcome:
{
"query_text": "implement binary search in Python",
"task_type": "coding",
"model_used": "Qwen2.5-7B-Coder:Q4_K_M",
"outcome": "success",
"latency_ms": 2340,
"timestamp": "2026-04-22T12:00:00Z"
}When a new query arrives, we embed it with nomic-embed-text-v1.5 (768-dim, cosine similarity) and retrieve the most similar historical queries. This tells us: "for queries semantically similar to this one, which models have historically succeeded?"
The memory score uses exponential decay to weight recent interactions more heavily:
score = Σ(outcome_i × λ^days_since_i × similarity_i) / Σ(λ^days_since_i × similarity_i)
Where λ = 0.95. An interaction from 14 days ago is weighted at ~49% of a fresh one.
What surprised me: Once the system has processed 50–100 queries, the memory signal starts dominating the routing decisions in a meaningful way. The benchmark scores tell you which model is generally better at coding. The memory layer tells you which model is better at your specific type of coding queries — which turns out to be different.
A user who mostly asks about async Python patterns will see the memory layer learn that Model A is consistently better for their queries specifically, even if Model B has higher overall HumanEval scores.
One important optimization: Instead of querying Qdrant once per candidate model (N queries for N models), I do a single search requesting top_k × num_models results and group client-side. This keeps the memory lookup to a single vector search regardless of how many models are registered.
Stage 4: Fallback Evidence
There’s a cold-start problem: new models have no memory data. And there’s a confidence problem: sometimes the classifier is uncertain.
When routing confidence is below 0.3, the system checks for fallback evidence — any model that has successfully handled ≥2 similar queries with a success rate above 60%. This prevents the router from making blind decisions on unfamiliar query types.
What Didn’t Work
Attempt 1 — Embedding-based classification I tried using cosine similarity to a set of “prototype” embeddings per task type. Fast, but accuracy was poor (~70%) because the embedding space doesn’t cleanly separate task types at query length.
Attempt 2 — Always routing to the highest benchmark model Before building the memory layer, I tested a simpler router that just always picked the model with the highest benchmark score for the task type. This worked well for clear-cut queries but failed badly on ambiguous ones and ignored the latency cost of frequently loading a large model.
Attempt 3 — Equal weights across all signals Early versions used 0.25 weights across all four signals. Benchmark score deserves higher weight early (when memory is sparse), so I settled on 0.4/0.3/0.15/0.15 after testing. These weights are configurable via environment variables.
The Full System
The router is one component of LocalForge — a self-hosted AI control plane I built that handles the entire local LLM lifecycle:
- Model management — Browse and download GGUF models from HuggingFace with VRAM-aware filtering
- Inference serving — OpenAI-compatible /v1/chat/completions endpoint (just change your base URL)
- Benchmarking — Automated score fetching from HF model cards + local mini-evaluation
- LoRA finetuning — QLoRA training pipeline with live loss streaming via SSE
- RAG knowledge base — Document ingestion and retrieval with LlamaIndex + Qdrant
- Dashboard — Real-time hardware monitoring, routing traces, memory statistics
Everything runs locally. No Docker required. No cloud dependencies. SQLite for relational data, Qdrant in disk-persisted mode for vectors.
Stack: FastAPI · Next.js 16 · llama.cpp · Qdrant · LlamaIndex · PEFT · TRL · scikit-learn
Results
On my Ada 6000 (48GB VRAM) running a mix of Qwen2.5–7B, Mistral-7B, and a coding-optimized model:
- Classifier latency: 1–3ms (99th percentile <5ms)
- Routing overhead (full pipeline): <10ms additional latency
- Task classification accuracy: ~85% on held-out test set
- After 200 queries, memory-enhanced routing selects a different model than benchmark-only routing in ~30% of cases — and users rate those responses higher
What I’d Do Differently
Collect labeled routing data from day one. I built the classifier on synthetic data. Real query logs with human-labeled task types would significantly improve accuracy, especially on edge cases like “explain this code” (coding or reasoning?).
Add model-level confidence. Some models are good at most things but great at one thing. The current scoring treats benchmark scores as fixed, but a model’s effective score should depend on query difficulty, not just task type.
Implement adaptive weight tuning. The 0.4/0.3/0.15/0.15 weights are manually tuned. A simple bandit algorithm that adjusts weights based on feedback outcomes would be more principled.
Try It
LocalForge is open source under MIT license.
git clone https://github.com/al1-nasir/LocalForge.git
cd LocalForge/backend
pip install -r requirements.txt
uvicorn app.main:app --port 8010
If you’re building anything in the local LLM space or have thoughts on the routing approach — I’d genuinely love to hear it. What routing strategies are you using?
Muhammad Ali Nasir is an ML Engineer and final-year CS student at PIEAS, Islamabad. He builds production AI systems at alinasir.me
How I Built ML-Powered LLM Routing with <5ms Latency was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.