Context Engineering is the New Prompt Engineering

The skill that separates AI products that work from those that degrade over time has nothing to do with writing better prompts.

For the last two years, “prompt engineering” has been the skill everyone told you to learn. Find the right words. Structure the instruction well. Add a few examples. And the model will do what you want.

That advice was fine when we were building single-turn chatbots and one-shot classification tasks. It is no longer enough.

The moment you build anything that runs for more than one turn, calls tools, maintains state, or coordinates multiple agents, prompting becomes a small fraction of what determines your system’s behaviour. The bigger question becomes: what is the model actually looking at when it generates each response? What is in the context window, where is it positioned, and how much of it is signal versus noise?

That question is the domain of context engineering. And based on my experiences building agentic systems, it is the single most underrated skill in applied AI today.

What Is Context Engineering, and Why Should You Care?

Anthropic introduced the term in September 2025 and defined it as the natural progression of prompt engineering. The distinction is clean:

Prompt engineering is about writing effective instructions for the model.

Context engineering is about curating and managing the entire set of tokens the model sees at inference time- the system prompt, tool descriptions, retrieved documents, conversation history, scratchpad notes, memory, and everything else that lands in the context window.

Why does the distinction matter?

Because in an agentic system, the prompt you wrote might be 500 tokens. But the total context at step 47 of a long-running task might be 150,000 tokens. Your carefully crafted prompt is now 0.3% of what the model sees. The other 99.7%, tool outputs, prior conversation turns, retrieved documents, state from earlier steps, is what actually drives the model’s behaviour.

If you are only engineering the prompt, you are optimising 0.3% of the input and hoping the other 99.7% takes care of itself. It will not.

The Problem: Context Rot

Here is the uncomfortable truth about large context windows. Bigger is not better. Not always.

Research consistently shows that LLM performance degrades as context length increases. This phenomenon, called “context rot,” has been documented across every frontier model.

The Chroma research study (2025) tested 18 frontier models and found accuracy drops of 20–50% when context grew from 10K to 100K tokens. All models were affected. Some decayed slower, but none were immune.

The underlying mechanism is the “lost in the middle” phenomenon, first documented by researchers at Stanford. LLMs show a U-shaped attention curve: they attend strongly to information at the beginning and end of the context, but performance degrades significantly for information positioned in the middle.

How bad is it? On multi-document question answering, accuracy dropped by over 30% when the answer document moved from position 1 to the middle of the context. In some cases, performance with the answer document in the middle was worse than having no documents at all.

And this is not just about retrieval. A 2025 study showed that context length alone degrades performance, even when irrelevant tokens are replaced with whitespace. The sheer volume of tokens interferes with reasoning, regardless of their content.

Every token you add to the context window costs you a small amount of the model’s attention. Context engineering is the discipline of making that budget count.

The Four Core Strategies

LangChain’s Lance Martin articulated a clean framework that maps well to how I think about the problem in practice. There are four fundamental operations in context engineering: write, select, compress, and isolate.

Strategy 1: Write (Persist Information Outside the Window)

When humans work on complex tasks, we take notes. We do not try to hold everything in our heads. Agents need the same capability.

Scratchpads are the simplest implementation. The agent writes notes to a file or a runtime state object during task execution, then reads them back when needed. It sounds trivial. It is transformative.

Anthropic’s multi-agent research system demonstrates this clearly: the LeadResearcher agent saves its plan to memory before the context window fills, because if the window exceeds 200,000 tokens, it will be truncated and the plan would be lost.

How have I seen this work in practice?

  • A NOTES.md file that the agent updates after each major step, tracking decisions, open questions, and progress
  • A structured state object where specific fields persist across turns while the conversation history gets compacted
  • A TODO list the agent maintains and checks at the start of each new reasoning step

The key insight: if information needs to survive beyond the current context window, it must be written somewhere outside of it. Relying on the context window alone is like relying on short-term memory for a project that spans days.

Products like Claude Code, Cursor, and Windsurf all implement this pattern through rules files and memory systems. It is not a research curiosity. It is production infrastructure.

Strategy 2: Select (Pull in Only What Matters)

Not everything the agent has written or stored deserves to be in the context at every step. Selection is about intelligent filtering: surfacing only the information relevant to the current task.

There are three types of memories agents typically select from:

  • Episodic memories: examples of desired behaviour (few-shot examples from past interactions)
  • Procedural memories: instructions that steer behaviour (rules, guidelines, preferences)
  • Semantic memories: facts relevant to the current task (domain knowledge, user data)

The simplest implementations use fixed files that always get pulled in. Claude Code uses CLAUDE.md. Cursor uses rules files. These work for small, well-scoped memory sets.

But as the memory store grows, selection gets harder. You need semantic search, relevance scoring, or knowledge graphs to identify what matters for this specific step. The failure mode is over-inclusion: pulling in everything “just in case” and flooding the context with marginally relevant information that dilutes the signal.

The goal is not “give the model all the information.” It is “give the model the minimum information needed to succeed at this step.”

Strategy 3: Compress (Summarise Without Losing Signal)

Agent interactions can span hundreds of turns. Tool calls return enormous payloads. Conversation history grows with every exchange. Left unchecked, the context fills with stale, redundant, or low-value tokens.

Compression is about preserving meaning while reducing tokens. There are several practical patterns:

Compaction is the most common. When the context approaches its limit, the system summarises the conversation history and starts a fresh window with the summary. Claude Code implements this as “auto-compact” at 95% of the context window, preserving architectural decisions and unresolved bugs while discarding redundant tool outputs.

Tool result clearing is the lightest touch. Once a tool has been called deep in the conversation history, the raw result is rarely needed again. Clearing old tool outputs recovers significant token space with minimal information loss.

Summarisation at agent boundaries applies when one agent hands off to another. Instead of passing the full conversation history, the outgoing agent produces a condensed summary. Cognition reportedly uses this pattern, reducing tokens during knowledge hand-offs between agents.

The critical question with any compression: what are you willing to lose? Freeform summarisation tends to silently drop details like file paths, specific numbers, or decision rationale. Structured summarisation, where the summary must populate specific sections (decisions made, files modified, open questions, current state), forces preservation of critical details.

Factory.ai’s research compared compression approaches and found that structured summarisation retained significantly more useful information than freeform alternatives. Structure forces preservation.

Strategy 4: Isolate (Compartmentalise to Prevent Contamination)

This strategy is about keeping context clean by giving different tasks their own separate context spaces.

Sub-agent architectures are the primary implementation. Instead of one agent trying to hold everything, specialised sub-agents handle focused tasks with clean context windows and return only condensed summaries.

Anthropic’s deep research system demonstrates this: each sub-agent explores extensively, using tens of thousands of tokens, but returns only 1,000–2,000 tokens of distilled findings. The lead agent never sees the messy exploration. It sees clean summaries.

Other isolation patterns include:

  • Sandboxed execution environments where tool results live as variables in the environment, not in the context window
  • Structured runtime state where only specific fields are exposed to the model while heavy data stays hidden
  • Scoped tool sets where each step only sees the tools relevant to its task, preventing confusion from overlapping descriptions
Isolation is context hygiene. The less cross-contamination between tasks, the more focused the model’s attention at each step.

The System Prompt Trap

One thing I want to address because I see it constantly: teams that treat the system prompt as a dumping ground.

Every edge case gets a new paragraph. Every policy gets a bullet point. Every past failure gets a new rule. The system prompt grows to 5,000 tokens, then 10,000, then 20,000. And performance gets worse, not better.

Why? Because longer system prompts consume a larger share of the attention budget. They push retrieved documents and conversation history further into the middle of the context, exactly where the model attends least. And contradictory instructions (which inevitably creep in as the prompt grows) create confusion.

Anthropic’s guidance is to find the “right altitude” for system prompts: specific enough to avoid ambiguity, but concise enough to avoid diluting attention. If a human engineer reading your system prompt finds it overwhelming, the model will too.

The test I use: can you explain what this agent does in three sentences? If not, the system prompt is probably trying to do too much and the system needs architectural decomposition, not a longer prompt.

Why This Changes How You Design Systems

Here is what shifted for me once I started thinking in context rather than in prompts.

Before: I would try to make one agent smarter by giving it more instructions, more tools, more context.

After: I design systems that keep each agent’s context small, focused, and fresh.

That shift changes everything:

  • Instead of one agent with 15 tools, I use three agents with 5 tools each, each with a clean context scoped to its task
  • Instead of passing full conversation history forward, I compress at boundaries and write critical state to external storage
  • Instead of retrieving 20 documents “for coverage,” I retrieve 3–5 highly relevant ones and position them where the model will attend
  • Instead of growing the system prompt every time something breaks, I ask “what should be in context at this step and what should not be?”

This is not about being clever with prompts. It is about treating context as a finite, precious resource and engineering the system to use it efficiently.

Practical Checklist

If you are building agentic systems, here are the questions I ask at every design review:

On writing context:

  • Does the agent persist critical state outside the context window?
  • Is there a scratchpad, a notes file, or a structured state object?
  • Will progress survive if the context is compacted or a new window starts?

On selecting context:

  • At each step, is the agent seeing only what it needs?
  • Are retrieved documents scored and filtered, or dumped in wholesale?
  • Are tool descriptions minimal and unambiguous?

On compressing context:

  • What happens when the context window approaches its limit?
  • Is compression structured (with required sections) or freeform?
  • Are stale tool outputs cleared?

On isolating context:

  • Does each sub-task operate in its own clean context?
  • Do sub-agents return summaries or full histories?
  • Are tool sets scoped per task?

On system prompt health:

  • Can you explain the agent’s purpose in three sentences?
  • Is the system prompt under 2,000 tokens?
  • Have you removed rules that should be enforced architecturally rather than instructionally?

The Takeaway

Prompt engineering taught us to write better instructions. Context engineering teaches us to architect what the model sees.

As models get smarter, the quality of the prompt matters less. The quality of the context matters more. A well-designed context on a decent model outperforms a brilliant prompt drowning in noise on the best model.

If you are building anything more complex than a single-turn chatbot, context engineering is no longer optional. It is the core discipline that determines whether your agent works at step 5 or falls apart at step 50.

The model is already smart enough. The question is whether you are giving it the right information, at the right time, in the right amount.

Context Engineering is the New Prompt Engineering was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top