
For years, my world was the internals of deep learning world spanning across attention mechanisms, transformer architectures, loss functions, gradient flow, etc. My hands were always in data pipelines, model training loops, and benchmark evaluations. The kind of work where you know exactly what every tensor is doing at every step.
But the industry moved. Quietly at first, then all at once. The conversation stopped being about training better models and started being about connecting powerful models to real-world systems. Open-source models like LLaMA, Mistral, Qwen became genuinely capable. The bottleneck shifted from “make the model smarter” to “make the model actually useful in production.”
The question is no longer “can the model reason?” It’s “can the model reason about your data?”
That’s the world of AI agents, RAG, and MCP. This post is everything I now understand explained simply, built up carefully.
A Real Problem First
Before any theory, a concrete example.
Hospital ICU Monitoring
An ICU has 40 patients. Each patient has vitals streaming every 5 seconds, heart rate, oxygen saturation, blood pressure, temperature. Doctors can’t watch 40 dashboards simultaneously. Critical deteriorations get missed.
A nurse asks: “Any patients showing early signs of sepsis in the last 6 hours?”
The agent queries the vitals database for rising temperature + falling blood pressure + elevated heart rate. Then searches clinical guidelines for early sepsis indicators. Responds: “Patient 14 and Patient 31 show 3 of 4 SIRS criteria since 2am. Recommend immediate assessment.”
No dashboard. No manual scanning. Natural language, grounded in real data and real medical knowledge.
That’s the pattern:
- Live structured data → MCP tools
- Medical guidelines → RAG
- Reasoning across both → AI agent
Let’s understand each piece one by one.
What is an AI Agent?
A regular LLM answers one question, once, and stops. You ask, it responds. End of transaction.
An AI agent is different. It’s an LLM running in a loop, given a goal, it decides what to do, takes an action, observes the result, and decides what to do next. It keeps going until the goal is achieved. No human guidance between steps.
Example loop:
Goal: “Why did my cloud bill spike this month?”
- Think: I need to see this month’s billing data first
- Act: call get_billing_summary tool → pulls from billing DB
- Observe: total spend jumped 340% compared to last month
- Think: need to know which service caused it
- Act: call get_cost_by_service tool → breaks down spend per service
- Observe: EC2 compute costs account for 89% of the spike
- Think: need to know which specific instances or jobs ran
- Act: call get_ec2_usage_logs tool → queries usage database
- Observe: 47 large GPU instances spun up on the 14th and ran for 6 days
- Think: need to know who triggered them
- Act: call get_deployment_events tool → checks deployment history DB
- Observe: a model training job was triggered by an automated pipeline on the 14th with no cost cap set
- Answer: “Your bill spiked because an automated training pipeline launched 47 GPU instances on the 14th with no spending limit. They ran for 6 days costing $12,400. The job was triggered by a scheduled CI pipeline. Recommend setting a cost cap and adding an approval step for GPU jobs above a threshold.”
No human said now check EC2. No human said now find who triggered it. The agent reasoned its way through autonomously.
LangGraph: Structuring the Loop
Raw agent loops are unpredictable. LangGraph orchestrates them by making the workflow an explicit graph.
- Nodes are each step the agent takes
(ex: classify intent, call MCP tool, search RAG, synthesize answer) - Edges define what comes next based on the result
- Conditional routing, if a tool fails, route to a fallback node
- State persists throughout, so agent remembers what it already learned.
- Debuggable, you can see exactly which node failed, not a black box loop going wrong somewhere
If an MCP tool returns an error, the agent can retry with different parameters or tell the user clearly what went wrong. The whole flow is transparent.
MCP: The Universal Connector
Imagine you’re building a customer support chatbot. It needs to read from your database, search your Google Drive, and pull Slack messages. You write custom code to connect each one. Next month you switch from GPT to Claude. You rewrite all three connectors. New LLM next year, rewrite again.
That’s the N×M problem. N LLMs × M data sources = N×M custom integrations. Every team rebuilding the same thing differently.
Before MCP:
- Chatbot → custom DB connector
- Chatbot → custom Google Drive connector
- Chatbot → custom Slack connector
- Switch LLM → rewrite all three
After MCP:
- DB → one MCP server, built once, never changes
- Google Drive → one MCP server, built once
- Slack → one MCP server, built once
- Any LLM that speaks MCP connects to all three automatically
Think of it as USB-C for AI (the analogy). Before USB-C every device had a different charger. With USB-C, one port, everything works. MCP is that one port for LLMs.
How it works?
MCP has three roles.
- The Host is the application the user interacts with, Claude Desktop, your chatbot
- The Client lives inside the host and manages the connection
- The Server is your code that exposes tools
MCP servers can expose three things.
- Tools are functions the LLM can call
- Resources are data the LLM can read as context, like a DB schema
- Prompts are reusable templates the user can invoke
The LLM reads each tool’s name and docstring at startup. When a user asks “what was the bill last month?” the LLM matches the question semantically to get_billing_summary and calls it with the right parameters.
The docstring is not just documentation, it’s the interface between the LLM and your tool. Writing it well is prompt engineering inside your code.
Transport options
stdio is standard input/output, used for local development. SSE (Server-Sent Events), is used for remote production deployments where the MCP server is hosted separately.
MCP vs REST API
REST is a fixed endpoint. You call it, it returns, done. You hardcode what to fetch. MCP lets the LLM itself discover and decide which tool to call at runtime, dynamically based on the question. MCP is also stateful within a session, the agent can call multiple tools in sequence and maintain context across them.
RAG: Teaching AI Your Documents
LLMs are trained on the internet. They don’t know your company’s maintenance manuals, your internal SOPs, your proprietary procedures. RAG (Retrieval-Augmented Generation), solves this without retraining the model.
The idea: at query time, retrieve relevant documents and inject them into the LLM’s context. The model reasons over your documents, not just its training data.
The full pipeline
Indexing time:
- Upload PDF or manual
- Chunk into smaller pieces semantically
- Add metadata to each chunk (source, version, section, etc)
- Generate vector embeddings
- Store in vector DB
Query time:
- User asks a question
- Embed the question using the same model
- Filter by metadata to narrow the search space
- Similarity search in the vector DB
- Rerank results for true relevance
- Inject top chunks into LLM context
- LLM answers with citations
Vector embeddings
Text converted into a list of hundreds of numbers, where similar meanings end up close together in that numerical space.
“Cloud bill spiked” and “AWS costs increased” end up near each other. “Deployment pipeline config” ends up far away.
That’s semantic search, searching by meaning, not exact keywords.
Chunking
Every RAG system lives or dies by how it splits documents. Too small and you lose context. Too large and retrieval becomes imprecise.
- Fixed chunking splits every 500 tokens regardless of content. Mid-sentence cuts happen. Procedures split across chunks. Description in chunk 1, resolution steps in chunk 2. Retrieval finds half an answer.
- Semantic chunking splits where meaning changes.
- Each procedure can be one chunk. Complete, self-contained units.
Re-ranking: Why similarity is not relevance
Vector search finds the most similar text. But similar isn’t always the most relevant answer. A chunk describing what triggers a billing spike may rank higher than the chunk explaining how to set a cost cap, even though the user asked how to fix it.
Re-ranking uses a cross-encoder model that sees the query and chunk together. It doesn’t ask “are these similar?” It asks “does this chunk answer this question?” That distinction is everything.
Two stages:
- Stage 1: Vector search: fast and broad, retrieves top 20 candidates
- Stage 2: Re-ranker: slow and precise, scores each for true relevance
Bi-encoder (vector search):
- Query → [embed] → vector A
- Chunk → [embed] → vector B
- Score = cosine similarity(A, B)
- Less accurate → they never “see” each other
Cross-encoder (re-ranker):
- [Query + Chunk together] → model → relevance score 0.0 to 1.0
- Full attention between query and chunk
Only top 3 to 5 chunks go to the LLM. You can’t rerank 10,000 chunks, vector search narrows down fast, reranker picks the best from that shortlist.
What can go wrong
- If nothing in the vector DB answers the question, RAG still returns the closest chunks. The LLM then hallucinates from irrelevant context. Fix this with confidence thresholds, if the best chunk similarity is below 0.6, don’t retrieve at all. Add strict prompt instructions: “If the context does not contain the answer, say I don’t know.” Never guess.
- If the answer spans multiple chunks, retrieval might only find part of it. Fix this by retrieving more candidates, using parent document retrieval. Chunk small for search but return the full parent section on match and combining vector search with a knowledge graph.
- Documents go stale. When the manual is updated, only re-process the changed sections. Store a hash of each section, compare on new uploads, re-index only what changed. Fast, targeted, keeps answers current.
One way to fix the hardest RAG failures is a knowledge graph, here’s why.
Knowledge Graph: When RAG Isn’t Enough
RAG finds text that’s similar to your question. But it doesn’t understand how things connect.
Ask “what costs are related to GPU usage?”, RAG returns chunks that mention both words. A knowledge graph traverses the actual relationship: GPU job → triggers → EC2 instances → drives up → compute costs → impacts → monthly budget KPI.
The three building blocks:
- Node: An entity (Training Job, EC2 Instance, Monthly Budget)
- Edge: A relationship between entities (TRIGGERS, DRIVES_UP, IMPACTS)
- Property: Attributes on nodes or edge
Every fact stored as Subject → Relationship → Object
GraphRAG: Microsoft’s approach which combines both. Vector search finds similar content. Graph traversal understands how things connect. Together they dramatically reduce hallucinations on relationship-heavy questions.
Worth knowing: Knowledge graphs are powerful but expensive to build and maintain. For simple Q&A over documents, RAG alone is often enough. Add a knowledge graph when questions involve relationships, multi-hop reasoning, or a well-defined domain where the connections between things matter as much as the content.
Putting It All Together
This is where it gets interesting. The real power isn’t any one of these tools, it’s combining them.
Single chatbot, three sources:
User question
↓
Agent (LangGraph) decides what's needed
↓
MCP tools → live data from database
RAG search → relevant document chunks
Graph query → relationship context
↓
All three injected into single LLM context
↓
One unified answer with citations
Concrete example:
“Why is my bill so high and how do I stop it happening again?”
- MCP fetches
→ live cost breakdown, usage logs, which pipeline triggered it - RAG retrieves
→ cost governance documentation, budget cap configuration guide - LLM synthesizes
→ “Your bill spiked due to an uncapped training job. According to your cost governance policy section 3.2, GPU jobs above $1,000 require manual approval. Here’s how to configure that…”
Neither source alone gives that answer. That’s the point.
How to Prevent Hallucination
- Ground the LLM: System prompt says: only answer from the provided context, not from general knowledge
- Structured data from MCP: Numbers don’t hallucinate, LLM just reads and reports them
- Citations from RAG: LLM tells the user exactly which document section it used
- Temperature near zero: For factual queries, less creativity, more precision
The Details That Matter
Getting Retrieval Right
- Embeddings: Don’t embed one document at a time. Batch and async. Cache embeddings so identical text never re-embeds. The query and index must use the exact same model. Different models produce incompatible vector spaces and the results are silently garbage.
- Vector DB: Chroma and FAISS are fine for prototypes. For production use Pinecone, Milvus, or pgvector. pgvector sits inside PostgreSQL, so if you’re already running Postgres you get vector search for free with no extra infrastructure and metadata filtering using standard SQL.
- Retrieval: Don’t just top-k. Use hybrid search combining vector and keyword. Always filter by metadata first to narrow the search space before similarity search. Rerank before injecting into the LLM.
Scaling It Up
- Scale: A single API endpoint breaks under load. Use async and workers. Autoscaling and load balancing for production traffic. Multi-layer caching: cache query results, cache embeddings, cache recent responses.
- Multi-tenancy: Never share vector spaces across tenants. Each factory, each customer, each team gets its own isolated chunk space. Metadata must include a tenant ID. Filter by tenant before any similarity search.
- Context: Don’t dump all retrieved chunks into the LLM. Compress and be token-aware. The LLM has a limited context window. Prioritize ruthlessly. Include citations so the LLM can tell the user where each piece of information came from.
Keeping It Honest
- Queries: Raw user queries are often bad search queries. Rewrite them before searching. “Bill went up again” is a terrible search query. “AWS EC2 cost spike causes and prevention” is much better. An LLM can rewrite queries automatically before retrieval.
- Monitoring: Log every retrieval. What was fetched, what was ignored, what score it got. Track retrieval quality separately from answer quality. A great LLM can’t save bad retrieval. Trace full agent runs, which tools were called, in what order, what each returned.
MCP vs RAG: When to Use Which
Use RAG when:
- Answering from documents (manuals, SOPs, compliance guides, wikis)
- The knowledge is relatively static.
- The question is “what does X mean” or “how does X work.”
- The data is unstructured text.
Use MCP when:
- Answering from live data (sensor readings, KPIs, database records)
- The data changes frequently
- The question is “what is happening now” or “what happened yesterday”
- The data is structured and numerical
Use both when:
- Question requires reasoning across live data & documentation together
- “Why is my bill so high and how do I stop it from happening again?” needs live cost breakdown and usage logs from MCP tools and the cost governance documentation and budget cap configuration guide from RAG. Neither source alone gives the complete answer.
Can you use RAG instead of MCP?
Technically yes. But, it’s the wrong tool. Data could be updating every 10 seconds, you’d need constant re-indexing. Vector search on numbers is inaccurate. “What was production at exactly 2:47pm?” is not a semantic question. It’s a lookup. Use MCP.
Can you skip MCP and just query the database directly?
Yes. But MCP adds reusability, any LLM can call it without changes. It gives the LLM agency to decide what to fetch based on the question, rather than your backend hardcoding what to pull. It adds clean separation, the chatbot doesn’t need to know SQL or table names. And it adds a standardized interface that any future LLM can plug into.
Where This All Goes
The models are no longer the bottleneck, open-source LLMs are genuinely capable and accessible. The real challenge now is connecting them to your world: your data, your documents, your systems. Agents, MCP, and RAG are the answer to that challenge. You don’t need to understand every nuance before you start, pick a real problem, build the smallest version of it, and the concepts will click faster than any amount of reading.
From Models to Agents: The Missing Layer Between AI and Real Problems was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.