The retrieval step is solved. The assembly step is where production RAG actually fails.

There’s a moment every team hits after their RAG pipeline goes live.
The vector search is tuned. Embedding quality is solid. Top-k retrieval looks reasonable in testing. Then a real user asks a real question, and the LLM confidently returns something wrong, not hallucinated from nothing, but wrong in an almost worse way: the right documents were retrieved, and the answer is still bad.
This is not a retrieval problem. It’s a context assembly problem. And most engineers building RAG systems have spent almost no time thinking about it.
I’ve built RAG pipelines that connect LLM systems to enterprise knowledge bases and operational datasets at scale. The retrieval plumbing is the easy part. What breaks in production consistently is what happens between retrieval and generation: how you rank, filter, and assemble context before the model ever sees it.
Here’s what’s actually going wrong, and how to fix it like an engineer.
The Retrieval-Generation Gap Nobody Talks About
Standard RAG architecture looks clean on a whiteboard:
- User query → embed → vector search → top-k chunks
- Top-k chunks → stuff into prompt → LLM → answer
The problem is step 2. “Stuff into prompt” is doing an enormous amount of work that teams treat as trivial. It isn’t.
Vector similarity retrieves semantically related content. It does not retrieve contextually prioritized content. Those are different operations. Cosine similarity has no opinion on whether a retrieved chunk is from an outdated policy document versus the current one. It has no opinion on whether two of your top-5 chunks are near-duplicates. It has no opinion on whether the most relevant sentence is buried in chunk 4 of 5, where the LLM’s attention will degrade.
You’re handing the model a pile of documents and asking it to figure out the priority ordering. It often gets this wrong — and you won’t know until a user notices.
Failure Mode 1: Retrieval Without Reranking
Vector search gives you candidates, not answers. The distinction matters enormously in production.
A bi-encoder (the standard embedding model used in vector search) is optimized for speed at scale. It produces a single vector per chunk and computes similarity in O(1) after indexing. That’s why it scales. But speed comes at a cost: the bi-encoder never directly compares the query against each candidate in context. It compares compressed representations.
A cross-encoder (reranker) does something different. It takes the query and a candidate chunk together, runs them through a transformer jointly, and produces a relevance score based on their direct interaction. It’s significantly slower — you can’t run it across your full corpus — but it’s far more accurate on the shortlist.
The production pattern that actually works:
User Query
↓
Bi-encoder vector search → top-50 candidates (fast, approximate)
↓
Cross-encoder reranker → top-5 candidates (slow, precise)
↓
Context assembly → LLM
You use retrieval to narrow the candidate pool. You use reranking to establish true relevance order. Most teams skip the reranking layer entirely because it wasn’t in the tutorial they followed.
In practice, reranking alone will fix a meaningful percentage of your bad outputs. Not because retrieval was returning garbage but because it was returning the right documents in the wrong order, and your LLM was reading the wrong chunk first.
Failure Mode 2: Query-Document Mismatch at the Semantic Layer
There’s a structural problem with how most teams embed queries versus documents.
Your documents were written to answer questions nobody has asked yet. Your user’s query is a short, often ambiguous natural language string. These two representations live in different regions of embedding space, not because the content is unrelated, but because the linguistic form is completely different.
A user typing “what’s the policy on remote work reimbursement” and a policy document titled “Employee Expense Guidelines — Section 4: Remote Work Stipends” should match. Cosine similarity will often disagree.
Two patterns that address this in production:
Query expansion. Before embedding the user’s query, generate 2–3 paraphrases or sub-questions using a lightweight LLM call. Retrieved against all of them. Union the results, deduplicate, then rerank. This dramatically increases recall on queries where the user’s phrasing doesn’t match your document vocabulary.
Hypothetical Document Embeddings (HyDE). Instead of embedding the query directly, use an LLM to generate a hypothetical answer to the query a plausible document that would answer it. Embed that instead. The resulting vector lives in document-space, not query-space, which means retrieval finds semantically similar documents much more reliably.
HyDE feels strange when you first encounter it. You’re generating a fake answer to find real answers. But the underlying logic is sound: you’re solving a representation mismatch problem by translating the query into the same linguistic space as your corpus.
Failure Mode 3: Context Stuffing Without a Position Strategy
Context window position is not neutral. Research has consistently shown that LLM attention is not uniformly distributed across a context window. Models tend to attend more strongly to content at the beginning and end of the context, with degraded attention to content in the middle. This is often called the “lost in the middle” problem.
If your most relevant chunk lands in position 3 of 5, you’re leaving performance on the table.
In production, context assembly should be deliberate, not sequential. The naive approach puts chunks in retrieval-score order: rank 1 first, rank 5 last. A better approach structures the context with the highest-relevance content at both the beginning and end, with supporting or supplementary context in the middle.
Beyond position, you also need a filter step before assembly:
- Deduplication. Embedding-based retrieval will often return near-duplicate chunks from the same source document. Stuffing all of them into context wastes tokens and confuses the model. Before assembly, the cluster retrieved chunks by semantic similarity and kept one representative per cluster.
- Temporal filtering. If your knowledge base has versioned documents, vector similarity will happily retrieve outdated content that scores higher than current content on semantic grounds. You need metadata filters not optional, not aspirational, enforced at query time.
- Chunk boundary quality. If your chunking strategy split a paragraph at a sentence boundary that removes critical context, no amount of retrieval quality saves you. The model gets an incomplete thought. Audit your chunk boundaries. Overlapping chunks (a 20–30% overlap is a reasonable starting point) reduce this failure significantly.
Failure Mode 4: No Query Routing, No Fallback
Not every user query is the same type of question. Treating them all identically — embed → retrieve → generate — will fail at scale.
Some queries are factual lookups. Some are comparative. Some require synthesis across multiple documents. Some are procedural. Some are ambiguous or malformed. A single retrieval strategy doesn’t handle all of these well.
In a production system, query routing is what separates a demo from something that actually works for real users:
def route_query(query: str) -> RetrievalStrategy:
query_type = classify_query(query) # fast LLM call or classifier
if query_type == "factual_lookup":
return SingleChunkRetrieval(top_k=3)
elif query_type == "comparative":
return MultiDocRetrieval(top_k=10, rerank=True)
elif query_type == "procedural":
return SequentialChunkRetrieval(preserve_order=True)
elif query_type == "ambiguous":
return ClarificationFlow()
else:
return DefaultRetrieval(top_k=5, rerank=True)
This doesn’t have to be complex. A lightweight classifier or a fast LLM call to determine query intent, routed to different retrieval configurations, will outperform any single-strategy approach on a diverse real-world query distribution.
Equally important: build a fallback. When retrieval confidence is low, when your reranker scores are below the threshold, when retrieved chunks have low relevance , don’t generate. Return a low-confidence signal, request clarification, or surface the gap explicitly. Confident answers built on weak retrieval are how RAG systems destroy user trust.
What a Production-Grade Context Assembly Layer Looks Like
Putting this together, here’s the architecture that actually holds up in production:
User Query
↓
[Query Routing] — classify intent, select strategy
↓
[Query Expansion / HyDE] — improve semantic coverage
↓
[Bi-encoder Retrieval] — top-50 candidates
↓
[Metadata Filtering] — temporal, permission, domain filters
↓
[Cross-encoder Reranking] — top-5 to 10 precise candidates
↓
[Deduplication] — cluster and deduplicate near-identical chunks
↓
[Position-Aware Assembly] — place highest relevance at beginning/end
↓
[Confidence Check] — if below threshold, route to fallback
↓
[LLM Generation]
Every step between retrieval and generation is an opportunity to either improve your answer quality or silently degrade it. Most teams implement none of these steps and then wonder why their RAG system underperforms in production.
The engineering work isn’t glamorous. There’s no new model architecture in this list. But this is the layer that makes the difference between a system that impresses in demos and one that your users actually trust.
The Real Problem
Most RAG failures are framed as retrieval failures because retrieval is the step teams use. You log what was retrieved. You know what the model generated. The steps in between how context was assembled, filtered, and positioned are usually invisible.
That invisibility is the problem. If you’re not observing the context assembly layer, you’re not debugging the right thing.
Start measuring what you’re putting in the prompt, not just what you’re getting out. Log chunk positions, reranker scores, deduplication decisions, and query routing outcomes. Build evals that test context quality independently of generation quality. Make the assembly layer a first-class engineering concern.
The retrieval step didn’t break your pipeline. The assembly step, you just couldn’t see it.
I’m a Senior AI Engineer at MasTec, where I architect production LLM agent systems and RAG pipelines connecting AI to enterprise knowledge bases and operational datasets. I write about what actually breaks when you ship AI at scale.
IEEE Senior Member | Patent Holder
Your RAG Pipeline Retrieves the Right Docs. Your LLM Still Gives the Wrong Answer. Here’s Why. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.