From Models to Agents: The Missing Layer Between AI and Real Problems

For years, my world was the internals of deep learning world spanning across attention mechanisms, transformer architectures, loss functions, gradient flow, etc. My hands were always in data pipelines, model training loops, and benchmark evaluations. The kind of work where you know exactly what every tensor is doing at every step.

But the industry moved. Quietly at first, then all at once. The conversation stopped being about training better models and started being about connecting powerful models to real-world systems. Open-source models like LLaMA, Mistral, Qwen became genuinely capable. The bottleneck shifted from “make the model smarter” to “make the model actually useful in production.”

The question is no longer “can the model reason?” It’s “can the model reason about your data?”

That’s the world of AI agents, RAG, and MCP. This post is everything I now understand explained simply, built up carefully.

A Real Problem First

Before any theory, a concrete example.

Hospital ICU Monitoring

An ICU has 40 patients. Each patient has vitals streaming every 5 seconds, heart rate, oxygen saturation, blood pressure, temperature. Doctors can’t watch 40 dashboards simultaneously. Critical deteriorations get missed.

A nurse asks: “Any patients showing early signs of sepsis in the last 6 hours?”

The agent queries the vitals database for rising temperature + falling blood pressure + elevated heart rate. Then searches clinical guidelines for early sepsis indicators. Responds: “Patient 14 and Patient 31 show 3 of 4 SIRS criteria since 2am. Recommend immediate assessment.”

No dashboard. No manual scanning. Natural language, grounded in real data and real medical knowledge.

That’s the pattern:

Live structured data → MCP tools
Medical guidelines → RAG
Reasoning across both → AI agent

Let’s understand each piece one by one.

What is an AI Agent?

A regular LLM answers one question, once, and stops. You ask, it responds. End of transaction.

An AI agent is different. It’s an LLM running in a loop, given a goal, it decides what to do, takes an action, observes the result, and decides what to do next. It keeps going until the goal is achieved. No human guidance between steps.

Example loop:

Goal: “Why did my cloud bill spike this month?”

Think: I need to see this month’s billing data first
Act: call get_billing_summary tool → pulls from billing DB
Observe: total spend jumped 340% compared to last month
Think: need to know which service caused it
Act: call get_cost_by_service tool → breaks down spend per service
Observe: EC2 compute costs account for 89% of the spike
Think: need to know which specific instances or jobs ran
Act: call get_ec2_usage_logs tool → queries usage database
Observe: 47 large GPU instances spun up on the 14th and ran for 6 days
Think: need to know who triggered them
Act: call get_deployment_events tool → checks deployment history DB
Observe: a model training job was triggered by an automated pipeline on the 14th with no cost cap set
Answer: “Your bill spiked because an automated training pipeline launched 47 GPU instances on the 14th with no spending limit. They ran for 6 days costing $12,400. The job was triggered by a scheduled CI pipeline. Recommend setting a cost cap and adding an approval step for GPU jobs above a threshold.”

No human said now check EC2. No human said now find who triggered it. The agent reasoned its way through autonomously.

LangGraph: Structuring the Loop

Raw agent loops are unpredictable. LangGraph orchestrates them by making the workflow an explicit graph.

Nodes are each step the agent takes
(ex: classify intent, call MCP tool, search RAG, synthesize answer)
Edges define what comes next based on the result
Conditional routing, if a tool fails, route to a fallback node
State persists throughout, so agent remembers what it already learned.
Debuggable, you can see exactly which node failed, not a black box loop going wrong somewhere

If an MCP tool returns an error, the agent can retry with different parameters or tell the user clearly what went wrong. The whole flow is transparent.

MCP: The Universal Connector

Imagine you’re building a customer support chatbot. It needs to read from your database, search your Google Drive, and pull Slack messages. You write custom code to connect each one. Next month you switch from GPT to Claude. You rewrite all three connectors. New LLM next year, rewrite again.

That’s the N×M problem. N LLMs × M data sources = N×M custom integrations. Every team rebuilding the same thing differently.

Before MCP:

Chatbot → custom DB connector
Chatbot → custom Google Drive connector
Chatbot → custom Slack connector
Switch LLM → rewrite all three

After MCP:

DB → one MCP server, built once, never changes
Google Drive → one MCP server, built once
Slack → one MCP server, built once
Any LLM that speaks MCP connects to all three automatically

Think of it as USB-C for AI (the analogy). Before USB-C every device had a different charger. With USB-C, one port, everything works. MCP is that one port for LLMs.

How it works?

MCP has three roles.

The Host is the application the user interacts with, Claude Desktop, your chatbot
The Client lives inside the host and manages the connection
The Server is your code that exposes tools

MCP servers can expose three things.

Tools are functions the LLM can call
Resources are data the LLM can read as context, like a DB schema
Prompts are reusable templates the user can invoke

The LLM reads each tool’s name and docstring at startup. When a user asks “what was the bill last month?” the LLM matches the question semantically to get_billing_summary and calls it with the right parameters.

The docstring is not just documentation, it’s the interface between the LLM and your tool. Writing it well is prompt engineering inside your code.

Transport options

stdio is standard input/output, used for local development. SSE (Server-Sent Events), is used for remote production deployments where the MCP server is hosted separately.

MCP vs REST API

REST is a fixed endpoint. You call it, it returns, done. You hardcode what to fetch. MCP lets the LLM itself discover and decide which tool to call at runtime, dynamically based on the question. MCP is also stateful within a session, the agent can call multiple tools in sequence and maintain context across them.

RAG: Teaching AI Your Documents

LLMs are trained on the internet. They don’t know your company’s maintenance manuals, your internal SOPs, your proprietary procedures. RAG (Retrieval-Augmented Generation), solves this without retraining the model.

The idea: at query time, retrieve relevant documents and inject them into the LLM’s context. The model reasons over your documents, not just its training data.

The full pipeline

Indexing time:

Upload PDF or manual
Chunk into smaller pieces semantically
Add metadata to each chunk (source, version, section, etc)
Generate vector embeddings
Store in vector DB

Query time:

User asks a question
Embed the question using the same model
Filter by metadata to narrow the search space
Similarity search in the vector DB
Rerank results for true relevance
Inject top chunks into LLM context
LLM answers with citations

Vector embeddings

Text converted into a list of hundreds of numbers, where similar meanings end up close together in that numerical space.

“Cloud bill spiked” and “AWS costs increased” end up near each other. “Deployment pipeline config” ends up far away.

That’s semantic search, searching by meaning, not exact keywords.

Chunking

Every RAG system lives or dies by how it splits documents. Too small and you lose context. Too large and retrieval becomes imprecise.

Fixed chunking splits every 500 tokens regardless of content. Mid-sentence cuts happen. Procedures split across chunks. Description in chunk 1, resolution steps in chunk 2. Retrieval finds half an answer.
Semantic chunking splits where meaning changes.
Each procedure can be one chunk. Complete, self-contained units.

Re-ranking: Why similarity is not relevance

Vector search finds the most similar text. But similar isn’t always the most relevant answer. A chunk describing what triggers a billing spike may rank higher than the chunk explaining how to set a cost cap, even though the user asked how to fix it.

Re-ranking uses a cross-encoder model that sees the query and chunk together. It doesn’t ask “are these similar?” It asks “does this chunk answer this question?” That distinction is everything.

Two stages:

Stage 1: Vector search: fast and broad, retrieves top 20 candidates
Stage 2: Re-ranker: slow and precise, scores each for true relevance

Bi-encoder (vector search):

Query → [embed] → vector A
Chunk → [embed] → vector B
Score = cosine similarity(A, B)
Less accurate → they never “see” each other

Cross-encoder (re-ranker):

[Query + Chunk together] → model → relevance score 0.0 to 1.0
Full attention between query and chunk

Only top 3 to 5 chunks go to the LLM. You can’t rerank 10,000 chunks, vector search narrows down fast, reranker picks the best from that shortlist.

What can go wrong

If nothing in the vector DB answers the question, RAG still returns the closest chunks. The LLM then hallucinates from irrelevant context. Fix this with confidence thresholds, if the best chunk similarity is below 0.6, don’t retrieve at all. Add strict prompt instructions: “If the context does not contain the answer, say I don’t know.” Never guess.
If the answer spans multiple chunks, retrieval might only find part of it. Fix this by retrieving more candidates, using parent document retrieval. Chunk small for search but return the full parent section on match and combining vector search with a knowledge graph.
Documents go stale. When the manual is updated, only re-process the changed sections. Store a hash of each section, compare on new uploads, re-index only what changed. Fast, targeted, keeps answers current.

One way to fix the hardest RAG failures is a knowledge graph, here’s why.

Knowledge Graph: When RAG Isn’t Enough

RAG finds text that’s similar to your question. But it doesn’t understand how things connect.

Ask “what costs are related to GPU usage?”, RAG returns chunks that mention both words. A knowledge graph traverses the actual relationship: GPU job → triggers → EC2 instances → drives up → compute costs → impacts → monthly budget KPI.

The three building blocks:

Node: An entity (Training Job, EC2 Instance, Monthly Budget)
Edge: A relationship between entities (TRIGGERS, DRIVES_UP, IMPACTS)
Property: Attributes on nodes or edge

Every fact stored as Subject → Relationship → Object

GraphRAG: Microsoft’s approach which combines both. Vector search finds similar content. Graph traversal understands how things connect. Together they dramatically reduce hallucinations on relationship-heavy questions.

Worth knowing: Knowledge graphs are powerful but expensive to build and maintain. For simple Q&A over documents, RAG alone is often enough. Add a knowledge graph when questions involve relationships, multi-hop reasoning, or a well-defined domain where the connections between things matter as much as the content.

Putting It All Together

This is where it gets interesting. The real power isn’t any one of these tools, it’s combining them.

Single chatbot, three sources:

User question
      ↓
Agent (LangGraph) decides what's needed
      ↓
MCP tools    →  live data from database
RAG search   →  relevant document chunks
Graph query  →  relationship context
      ↓
All three injected into single LLM context
      ↓
One unified answer with citations

Concrete example:

“Why is my bill so high and how do I stop it happening again?”

MCP fetches
→ live cost breakdown, usage logs, which pipeline triggered it
RAG retrieves
→ cost governance documentation, budget cap configuration guide
LLM synthesizes
→ “Your bill spiked due to an uncapped training job. According to your cost governance policy section 3.2, GPU jobs above $1,000 require manual approval. Here’s how to configure that…”

Neither source alone gives that answer. That’s the point.

How to Prevent Hallucination

Ground the LLM: System prompt says: only answer from the provided context, not from general knowledge
Structured data from MCP: Numbers don’t hallucinate, LLM just reads and reports them
Citations from RAG: LLM tells the user exactly which document section it used
Temperature near zero: For factual queries, less creativity, more precision

The Details That Matter

Getting Retrieval Right

Embeddings: Don’t embed one document at a time. Batch and async. Cache embeddings so identical text never re-embeds. The query and index must use the exact same model. Different models produce incompatible vector spaces and the results are silently garbage.
Vector DB: Chroma and FAISS are fine for prototypes. For production use Pinecone, Milvus, or pgvector. pgvector sits inside PostgreSQL, so if you’re already running Postgres you get vector search for free with no extra infrastructure and metadata filtering using standard SQL.
Retrieval: Don’t just top-k. Use hybrid search combining vector and keyword. Always filter by metadata first to narrow the search space before similarity search. Rerank before injecting into the LLM.

Scaling It Up

Scale: A single API endpoint breaks under load. Use async and workers. Autoscaling and load balancing for production traffic. Multi-layer caching: cache query results, cache embeddings, cache recent responses.
Multi-tenancy: Never share vector spaces across tenants. Each factory, each customer, each team gets its own isolated chunk space. Metadata must include a tenant ID. Filter by tenant before any similarity search.
Context: Don’t dump all retrieved chunks into the LLM. Compress and be token-aware. The LLM has a limited context window. Prioritize ruthlessly. Include citations so the LLM can tell the user where each piece of information came from.

Keeping It Honest

Queries: Raw user queries are often bad search queries. Rewrite them before searching. “Bill went up again” is a terrible search query. “AWS EC2 cost spike causes and prevention” is much better. An LLM can rewrite queries automatically before retrieval.
Monitoring: Log every retrieval. What was fetched, what was ignored, what score it got. Track retrieval quality separately from answer quality. A great LLM can’t save bad retrieval. Trace full agent runs, which tools were called, in what order, what each returned.

MCP vs RAG: When to Use Which

Use RAG when:

Answering from documents (manuals, SOPs, compliance guides, wikis)
The knowledge is relatively static.
The question is “what does X mean” or “how does X work.”
The data is unstructured text.

Use MCP when:

Answering from live data (sensor readings, KPIs, database records)
The data changes frequently
The question is “what is happening now” or “what happened yesterday”
The data is structured and numerical

Use both when:

Question requires reasoning across live data & documentation together
“Why is my bill so high and how do I stop it from happening again?” needs live cost breakdown and usage logs from MCP tools and the cost governance documentation and budget cap configuration guide from RAG. Neither source alone gives the complete answer.

Can you use RAG instead of MCP?

Technically yes. But, it’s the wrong tool. Data could be updating every 10 seconds, you’d need constant re-indexing. Vector search on numbers is inaccurate. “What was production at exactly 2:47pm?” is not a semantic question. It’s a lookup. Use MCP.

Can you skip MCP and just query the database directly?

Yes. But MCP adds reusability, any LLM can call it without changes. It gives the LLM agency to decide what to fetch based on the question, rather than your backend hardcoding what to pull. It adds clean separation, the chatbot doesn’t need to know SQL or table names. And it adds a standardized interface that any future LLM can plug into.

Where This All Goes

The models are no longer the bottleneck, open-source LLMs are genuinely capable and accessible. The real challenge now is connecting them to your world: your data, your documents, your systems. Agents, MCP, and RAG are the answer to that challenge. You don’t need to understand every nuance before you start, pick a real problem, build the smallest version of it, and the concepts will click faster than any amount of reading.

From Models to Agents: The Missing Layer Between AI and Real Problems was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.