Your RAG Pipeline Is Lying to You. The Problem Is Not the Embeddings

“Retrieval-Augmented Generation is mostly a solved problem.” I heard this at a team review in late 2024. My reaction was: that depends entirely on what you mean by retrieval.

I have spent the better part of two years building RAG pipelines in production — for healthcare record summarisation at my current firm, for NL-to-SQL workflows at my previous firm where the SQL is just the beginning, and for multi-agent research systems on Databricks. In that time I have hit the same ceiling over and over.

The ceiling is not the embeddings. It is the assumption that retrieval should happen exactly once, at a fixed step, using a single source.

Static RAG — embed your docs, retrieve top-k, feed to LLM — works well for simple, self-contained questions against stable internal knowledge. The moment the question becomes multi-hop, time-sensitive, or grounded in both internal docs and live data, the static pipeline starts producing confident-sounding hallucinations. I have seen it prescribe clinical procedures based on outdated internal guidelines when a 30-second web check would have caught the change.

The fix is not better embeddings. It is giving the LLM the ability to decide when and where to retrieve.

TL;DR: Static RAG pipelines fail at multi-hop and time-sensitive questions because retrieval is hardcoded into the architecture. Agentic RAG treats retrieval as a tool the LLM calls on demand — from vector stores, web, or both. You leave with a working LangGraph implementation that handles hybrid doc + web retrieval.

The Problem with “Retrieve Once, Answer Always”

Think of static RAG as a librarian who can only answer you by walking to one shelf — the same shelf, every time, before you finish your sentence.

The question “What are the side effects of metformin?” hits the shelf, finds something, and answers. Fine.

The question “What are the latest FDA-flagged interactions between metformin and the GLP-1 drugs approved after 2023, and does our current patient protocol account for them?” breaks the librarian. It needs to check internal documents and live regulatory data. It needs to reason about a gap between the two. Static RAG cannot do that. It retrieves once, from one source, and the LLM papers over whatever it finds.

This is not a hypothetical. On a healthcare pipeline I worked on, static RAG achieved 89% answer relevance on stable questions but dropped to 61% on questions that required cross-referencing internal clinical docs with external treatment guidelines. The failure was invisible to basic evals because the answers sounded correct — they were just outdated. [1]

The architectural fix is straightforward to describe and non-trivial to implement: make retrieval a decision, not a step.

What Agentic RAG Actually Means

In agentic RAG, the LLM does not retrieve on your behalf at a fixed point in the pipeline. It calls retrieval as a tool — the same way it might call a calculator or a code executor — when it decides retrieval is needed.

This unlocks three things that static pipelines cannot do:

Conditional retrieval. The agent skips retrieval entirely for questions it can answer from context. It calls retrieval only when the evidence it has is insufficient.

Multi-hop retrieval. The agent retrieves, reads, decides it needs more, and retrieves again. Each retrieval is informed by the previous result. This is how humans actually research.

Source routing. The agent chooses the retrieval source — vector store, web search, structured database — based on the query type. “What does our policy say?” hits the vector store. “What did the FDA announce last week?” hits the web.

This is not a new idea. The original ReAct paper [2] from Yao et al. (2022) established the Reasoning + Acting pattern where LLMs interleave reasoning steps with tool calls. Agentic RAG is ReAct applied specifically to retrieval.

What is new is the tooling. LangGraph makes stateful, conditional agent loops practical to build. Tavily provides an agent-native web search API that returns LLM-ready content at production scale — they handle 100M+ monthly requests and maintain a p50 latency of 180ms on search. [3]

The Architecture — Three Moving Parts

The agentic RAG loop has three components. I will name the analogy before the labels.

Think of it as a detective, not a librarian.

A librarian retrieves on command. A detective reads what they have, decides what is missing, goes to get it, reads again, and keeps going until they can form a defensible conclusion.

The three parts of the detective:

1. The router (what kind of retrieval does this query need?) The LLM receives the query and decides: can I answer from existing context? Does this need document retrieval? Does this need web retrieval? Does it need both?

2. The retrieval tools (the evidence sources) At minimum: a vector store tool for internal documents, and a web search tool for live data. Each is a callable function the LLM invokes with a query string.

3. The grounding check (is the evidence enough to answer?) Before generating the final response, the agent checks whether the retrieved evidence is sufficient and relevant. If not, it retrieves again. This is the loop that static pipelines cannot express.

What This Actually Looks Like in Practice

Here is a working LangGraph implementation of a hybrid agentic RAG agent. It routes between a Qdrant vector store and Tavily web search. The agent decides which tool to call, can call both in sequence, and will re-retrieve if the first result does not ground the answer.

from typing import Annotated, Literal
from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI
from langchain_qdrant import QdrantVectorStore
from langchain_openai import OpenAIEmbeddings
from tavily import TavilyClient
from qdrant_client import QdrantClient
import operator
from typing import TypedDict, Sequence

# --- State definition ---
class AgentState(TypedDict):
    messages: Annotated[Sequence, operator.add]
    retrieval_count: int  # guard against infinite loops

# --- Tool 1: Vector store retrieval (internal docs) ---
qdrant_client = QdrantClient(url="http://localhost:6333")
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name="internal_docs",
    embedding=embeddings,
)
@tool
def retrieve_from_docs(query: str) -> str:
    """
    Search internal document knowledge base.
    Use for questions about internal policies, protocols, or proprietary knowledge.
    """
    results = vector_store.similarity_search(query, k=4)
    if not results:
        return "No relevant internal documents found for this query."
    return "\n\n---\n\n".join([
        f"[Source: {doc.metadata.get('source', 'internal')}]\n{doc.page_content}"
        for doc in results
    ])

# --- Tool 2: Web retrieval (live data) ---
tavily = TavilyClient(api_key="YOUR_TAVILY_API_KEY")
@tool
def retrieve_from_web(query: str) -> str:
    """
    Search the live web for current information.
    Use for recent news, regulatory updates, or anything that may have changed
    since the internal knowledge base was last updated.
    """
    results = tavily.search(
        query=query,
        max_results=5,
        search_depth="advanced",  # deep retrieval for better accuracy
    )
    if not results.get("results"):
        return "No relevant web results found."
    return "\n\n---\n\n".join([
        f"[Source: {r['url']}]\n{r['content']}"
        for r in results["results"]
    ])

# --- Agent node ---
tools = [retrieve_from_docs, retrieve_from_web]
tool_node = ToolNode(tools)
llm = ChatOpenAI(model="gpt-4o", temperature=0).bind_tools(tools)
SYSTEM_PROMPT = """You are a research assistant with access to two retrieval tools:
- retrieve_from_docs: internal document knowledge base (policies, protocols, proprietary data)
- retrieve_from_web: live web search (recent news, regulatory updates, real-time information)
For any question:
1. Decide whether you need to retrieve information or can answer from context.
2. If retrieval is needed, choose the right source - or call both if the question spans internal and live knowledge.
3. After retrieving, check if the evidence is sufficient. If not, retrieve again with a refined query.
4. Always cite your sources in the final answer.
Do not hallucinate. If you cannot find sufficient evidence, say so explicitly."""
def call_agent(state: AgentState) -> AgentState:
    messages = [SystemMessage(content=SYSTEM_PROMPT)] + list(state["messages"])
    response = llm.invoke(messages)
    return {
        "messages": [response],
        "retrieval_count": state.get("retrieval_count", 0)
    }
def update_retrieval_count(state: AgentState) -> AgentState:
    return {
        "messages": [],
        "retrieval_count": state.get("retrieval_count", 0) + 1
    }

# --- Routing logic ---
MAX_RETRIEVALS = 3  # prevent infinite loops
def should_continue(state: AgentState) -> Literal["tools", "end"]:
    last_message = state["messages"][-1]
    # If the LLM called a tool and we haven't hit the loop guard, continue
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        if state.get("retrieval_count", 0) < MAX_RETRIEVALS:
            return "tools"
    return "end"

# --- Build the graph ---
graph_builder = StateGraph(AgentState)
graph_builder.add_node("agent", call_agent)
graph_builder.add_node("tools", tool_node)
graph_builder.add_node("count_retrieval", update_retrieval_count)
graph_builder.add_edge(START, "agent")
graph_builder.add_conditional_edges(
    "agent",
    should_continue,
    {"tools": "tools", "end": END}
)
graph_builder.add_edge("tools", "count_retrieval")
graph_builder.add_edge("count_retrieval", "agent")
agent = graph_builder.compile()

# --- Run it ---
if __name__ == "__main__":
    result = agent.invoke({
        "messages": [
            HumanMessage(content=(
                "What does our current patient protocol say about GLP-1 usage, "
                "and are there any FDA safety updates issued in 2025 we should be aware of?"
            ))
        ],
        "retrieval_count": 0,
    })
    print(result["messages"][-1].content)

A few things worth noting in this implementation.

The MAX_RETRIEVALS guard at line 3 of the routing logic is not cosmetic. Without it, I have watched agents spiral into 15 retrieval calls on a single ambiguous question, burning both tokens and patience. Three retrievals is usually enough. If it is not, the question is either too broad or the retrieval tools are returning garbage.

The retrieval_count field in the state is the mechanism that makes the loop breakable. LangGraph's StateGraph carries this across every node transition — it persists through the entire conversation turn, not just one cycle.

The system prompt is doing real work here. Telling the LLM when to use each tool, not just that tools exist, is what separates an agent that routes correctly from one that calls retrieve_from_web for every query out of laziness.

The Part Nobody Talks About: When Agentic RAG Makes Things Worse

This pattern does not replace static RAG. It replaces static RAG for specific problem shapes.

It makes things worse when:

Latency is the primary constraint. Every retrieval loop adds at least one LLM call and one tool call. For a simple Q&A assistant where p95 latency must stay under 1 second, an agentic loop that may run 2–3 cycles is a non-starter. Static RAG with a well-tuned retriever is faster and sufficient.

The knowledge base is stable and fully self-contained. If every question your users ask can be answered from your vector store with high confidence, you are adding complexity for no benefit. Run RAGAS metrics on your static pipeline first. If context recall is above 85% and answer faithfulness is above 90%, the agentic wrapper is overhead.

You cannot trust the LLM’s tool selection. In my experience, GPT-4o and Claude Sonnet route correctly between doc and web retrieval about 88% of the time with a well-written system prompt. The remaining 12% introduces retrieval-source errors that can be harder to debug than simple hallucinations, because the answer is grounded — just in the wrong source. If you are using a weaker model, instrument tool selection before shipping.

Real-time data is untrusted. Web retrieval adds a live attack surface. Tavily has prompt injection filtering built into the API [3], but no filter is perfect. If your agent is taking actions based on web-retrieved content, that content needs to be treated as potentially adversarial.

Use agentic RAG when: the knowledge spans multiple sources, some of which are dynamic. Start with static RAG and upgrade when RAGAS scores reveal a specific failure mode this architecture solves.

Where to Start — Three Entry Points

Smallest step (under 5 minutes): Add a single Tavily tool call to an existing LangChain agent using from langchain_community.tools.tavily_search import TavilySearchResults. You do not need LangGraph yet. Just expose the tool, write a system prompt that tells the LLM when to call it, and run five test queries. Watch what it chooses.

Intermediate (one afternoon): Build the two-tool LangGraph pattern from the code above against your own vector store. Skip the retrieval count guard initially to observe how often the agent loops. Then add the guard. The difference is informative.

Advanced: Add a grading node between the tool output and the next agent call. The grader is a second, cheaper LLM call that scores whether the retrieved context is sufficient (yes/no) before the main LLM synthesises an answer. This is the step that brings context recall from “good enough in testing” to “reliable in production.”

The 12% tool-routing error rate I mentioned is where most production debugging time goes. Instrument it early. The grading node catches it before users do.

Want to Go Deeper?

I’ve compiled a comprehensive GenAI Interview Prep Guide — 80+ questions with in-depth answers, architecture diagrams, and a 2-week study plan — as a downloadable resource.

👉 Get the Full GenAI Interview Prep Pack on Gumroad
👉 RAG interview questions

References

[1] Based on internal eval runs across a healthcare RAG pipeline using RAGAS metrics; not a published study. Answer relevance and context recall measured using RAGAS 0.1.x against a curated test set of 200 clinical queries.

[2] Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv preprint arXiv:2210.03629. https://arxiv.org/abs/2210.03629

[3] Tavily (2026). Tavily 101: AI-powered Search for Developers. Tavily Blog. https://www.tavily.com/blog/tavily-101-ai-powered-search-for-developers — latency figure (180ms p50) and request volume (100M+ monthly) cited directly from Tavily’s product page as of May 2026.

[4] LangGraph Documentation. Tool Node. LangChain. https://langchain-ai.github.io/langgraph/reference/prebuilt/#langgraph.prebuilt.tool_node.ToolNode

[5] Qdrant Documentation. Langchain Integration. https://qdrant.tech/documentation/frameworks/langchain/

[6] Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv preprint arXiv:2309.15217. https://arxiv.org/abs/2309.15217

Your RAG Pipeline Is Lying to You. The Problem Is Not the Embeddings — Agentic RAG was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.