Why Your RAG Pipeline Breaks in Production (And How to Fix It Like an Engineer)

You shipped the demo. It worked. Then your users started finding the edges.

Illustration generated using AI to visualize production RAG pipeline failures and repair workflows.

I’ve been building software long enough to know that “it works on my machine” is a rite of passage, not a finish line. RAG pipelines have their own version of this: they work beautifully on your curated test queries, then quietly fall apart on anything a real user actually types.

The difference between a RAG prototype and a production system isn’t the model. It’s the plumbing. And as engineers, plumbing is exactly what we’re supposed to be good at.

Here are the five failure modes I’ve debugged repeatedly — plus the engineering fixes that actually hold up under load.

First: What You’re Actually Building

A RAG pipeline is a retrieval system bolted to a generation system. That framing matters, because most RAG failures are retrieval failures wearing the costume of model failures.

When your users say “it hallucinated,” the first question isn’t “is the LLM bad?” It’s “did the right chunk even make it into the context?” Nine times out of ten, it didn’t.

Keep that mental model. It’ll save you hours of prompt-tweaking rabbit holes.

Failure 1: Your Text Splitter Is Lying to You

Fixed-size chunking — splitting by token count or paragraph — is the default in every RAG tutorial. It’s also where most retrieval quality goes to die.

Here’s the bug: your splitter doesn’t know that “…except for international orders, which follow a different policy” is semantically inseparable from the sentence it’s completing. It cuts at 512 tokens regardless. The answer to your user’s question now lives across two chunks, and your retriever will confidently return the wrong one.

This is a data pipeline bug, not a model bug. Treat it like one.

The fix — semantic chunking:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95 # only split on significant semantic shifts
)
docs = splitter.create_documents([raw_text])

Instead of cutting at N tokens, this measures embedding distance between consecutive sentences and only splits where meaning actually changes. It’s slower to index, but your retrieval precision improves dramatically.

Also add a re-ranker after retrieval:

from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: list[str], top_k: int = 4) -> list[str]:
pairs = [(query, chunk) for chunk in chunks]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, chunks), reverse=True)
return [chunk for _, chunk in ranked[:top_k]]

Embeddings measure vector proximity. A cross-encoder reads the query and each chunk together — much closer to how a human would judge relevance. Use embeddings to get your candidate set (top 20), then re-rank to your final context (top 4). That two-step pattern alone tends to close most retrieval quality gaps.

Failure 2: You’re Stuffing the Context Window

More chunks = more coverage, right? In practice, more chunks = worse answers.

There’s solid research on this (Liu et al., 2023 — “Lost in the Middle”) showing that LLMs systematically under-attend to content in the middle of long contexts. The first and last items get disproportionate weight. If your most relevant chunk lands at position 6 of 10, you’ve buried the answer.

This is a systems problem: you’re optimizing for recall at the cost of precision, and the model pays the price.

The fix — tight context, deliberate ordering:

def build_context(ranked_chunks: list[str], max_chunks: int = 4) -> str:
# Top-ranked chunk goes first — LLMs attend to position 0
selected = ranked_chunks[:max_chunks]
return "\n\n---\n\n".join(selected)

Four chunks, top chunk first, separator between them. That’s it. Fight the instinct to add more — you’re not scoring on recall, you’re scoring on whether the user gets a correct answer.

Failure 3: Your Prompt Isn’t Actually Grounding the Model

Here’s a frustrating one: you retrieved the right chunk, it’s sitting right there in the context, and the model still hallucinates. How?

Because your system prompt said “use the following context to answer” — and the model treated that as a suggestion, not a constraint. LLMs are trained to be helpful and coherent. When the retrieved context is incomplete or ambiguous, they fill the gap from training weights rather than saying “I don’t know.”

This is a contract enforcement problem. Your prompt needs to be explicit.

The fix — a prompt that enforces grounding:

SYSTEM_PROMPT = """You are a Q&A assistant that answers strictly from provided documents.
Rules:
1. Answer ONLY using information from the CONTEXT block below.
2. Do not use knowledge from your training data.
3. If the answer is not in the context, respond with exactly:
"This information isn't available in the provided documents."
4. Never infer, extrapolate, or guess beyond what is stated.
CONTEXT:
{context}"""

def build_prompt(query: str, context: str) -> list[dict]:
return [
{"role": "system", "content": SYSTEM_PROMPT.format(context=context)},
{"role": "user", "content": query}
]

The explicit failure phrase (Rule 3) is load-bearing. Without a specific string to fall back to, the model will generate a plausible-sounding answer rather than admit the gap. With it, you can detect that string downstream and handle it gracefully in your UI.

Optional — add a verification pass:

For high-stakes use cases, add a second LLM call that checks the answer against the context:

VERIFY_PROMPT = """Does the following answer contain only information present in the context?
Answer YES or NO, then explain briefly.
Context: {context}
Answer: {answer}"""

def verify_grounding(context: str, answer: str, llm) -> bool:
response = llm.invoke(VERIFY_PROMPT.format(context=context, answer=answer))
return response.content.strip().upper().startswith("YES")

It adds latency. Use it where correctness cost is high (legal, medical, financial) and skip it elsewhere.

Failure 4: Single-Query Retrieval for Multi-Hop Questions

This one’s subtle. Your retrieval works fine for simple factual queries. Then a user asks: “Compare the cancellation policies for Basic and Pro plan subscribers.”

A single embedding query for that sentence will return chunks about cancellation policies in general. It won’t reliably return the specific chunks about Basic and Pro tiers separately. You need two retrievals, not one — but your pipeline doesn’t know that.

The fix — query decomposition:

import json
DECOMPOSE_PROMPT = """Break the following question into simpler sub-questions,
each answerable by a single document lookup.
If the question is already simple, return it as-is.
Return a JSON array of strings. No explanation, just JSON.
Question: {query}"""

def decompose_query(query: str, llm) -> list[str]:
response = llm.invoke(DECOMPOSE_PROMPT.format(query=query))
try:
return json.loads(response.content)
except json.JSONDecodeError:
return [query] # fallback: treat as simple query

def multi_hop_retrieve(query: str, retriever, llm) -> list[str]:
sub_queries = decompose_query(query, llm)

all_chunks = []
seen = set()

for sub_q in sub_queries:
chunks = retriever.get_relevant_documents(sub_q)
for chunk in chunks:
if chunk.page_content not in seen:
seen.add(chunk.page_content)
all_chunks.append(chunk)

return all_chunks

For the cancellation policy question, this generates two sub-queries, retrieves independently, deduplicates, and passes a richer context to the LLM. The answer quality difference on multi-hop questions is usually significant.

Failure 5: You Can’t Debug What You Can’t See

This is the one that gets you in production. A user reports a bad answer. You reproduce the query. It returns the right answer now. You shrug and close the ticket.

Three weeks later, you realize 15% of queries on a certain document type have been returning hallucinated answers, and you have no idea when it started or why.

RAG pipelines have multiple moving parts: the query, the retrieved chunks, the scores, the assembled context, the prompt, the response. If you’re not logging all of it, you’re flying blind.

The fix — structured trace logging:

import logging
import json
from datetime import datetime
logger = logging.getLogger("rag")
def traced_query(
query: str,
retriever,
llm,
session_id: str = None
) -> dict:
chunks = retriever.get_relevant_documents(query)
context = build_context([c.page_content for c in chunks])
prompt = build_prompt(query, context)
response = llm.invoke(prompt)

trace = {
"timestamp": datetime.utcnow().isoformat(),
"session_id": session_id,
"query": query,
"retrieved_chunks": [
{
"preview": c.page_content[:150],
"score": c.metadata.get("relevance_score"),
"source": c.metadata.get("source")
}
for c in chunks
],
"context_length": len(context),
"response": response.content
}

logger.info(json.dumps(trace))
return {"response": response.content, "trace": trace}

Log this to your database, Datadog, or even a flat JSONL file. The point is that when something goes wrong, you can replay the exact retrieval state at the time of the failure.

Build an offline eval harness while you’re at it:

def evaluate_pipeline(test_cases: list[dict], retriever, llm) -> dict:
"""
test_cases: [{"query": str, "expected_chunk_keywords": list[str], "expected_answer": str}]
"""
retrieval_hits = 0

for case in test_cases:
chunks = retriever.get_relevant_documents(case["query"])
chunk_text = " ".join(c.page_content for c in chunks)

# Did retrieval find the right content?
if any(kw.lower() in chunk_text.lower()
for kw in case["expected_chunk_keywords"]):
retrieval_hits += 1

return {
"retrieval_recall": retrieval_hits / len(test_cases),
"total_cases": len(test_cases)
}

50–100 representative queries with known correct answers. Run this every time you change your chunking strategy, your embedding model, or your retrieval parameters. You want to catch regressions before your users do.

The Full Picture

Put it all together and your production RAG pipeline looks like this:

User query


Query decomposition (simple or multi-hop?)


Semantic retrieval — top 20 candidates


Cross-encoder re-ranking — top 4 kept


Context assembly — best chunk first


Grounded LLM call — explicit constraints


(Optional) Faithfulness verification


Response returned + full trace logged

None of these steps are exotic. Every library mentioned is open source. The engineering work is maybe a week of focused effort to retrofit onto an existing pipeline.

The thing that separates a demo from production isn’t intelligence — it’s the boring, careful work of handling edge cases, logging failures, and iterating on a real evaluation harness. Which is, incidentally, exactly what software engineers are for.

Building something interesting with LLMs or agents? I’m always up for swapping war stories in the comments. The bugs are always weirder than they look on paper.


Why Your RAG Pipeline Breaks in Production (And How to Fix It Like an Engineer) was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top