The Agentic Scratchpad: Why Your LLM Needs a Cache Tool

Stop battering your APIs and bloating your context window. Architecting intermediate memory for complex reasoning.

Image Source: Google Gemini

The Context Window Tax (The Problem)

In the last year, the AI industry has been obsessed with a single metric: context length. We have watched models scale from an 8k token limit to 128k, 200k, and even over a million tokens. The initial reaction from many developers was a sigh of relief. The prevailing thought was, “Great, I no longer have to worry about chunking or memory. I’ll just dump my entire database schema, the user’s ten-year CRM history, and the 50-page corporate policy manual into the prompt for every single question.”

This is what I call the “Context Dump” architecture, and in an enterprise production environment, it is a catastrophic anti-pattern.

Relying on massive context windows as a substitute for working memory introduces three massive problems:

1. The Token Tax (Financial Hemorrhage)

Every time an agent takes a turn in a conversation, the entire context window is re-processed. If your agent is holding 100,000 tokens of API payload data in its prompt to answer a multi-step user query, you are paying for those 100,000 tokens on step one, step two, step three, and so on. In a live application, this approach will burn through thousands of dollars just to repeatedly read the exact same static dataset.

2. Crippling Latency

LLMs do not read massive payloads instantly. Processing a 100k-token prompt takes time — often 10 to 30 seconds of Time-To-First-Token (TTFT). If your user is waiting 20 seconds for the agent to reply to “Did you find the error?”, your application is fundamentally broken from a user experience perspective.

3. The “Lost in the Middle” Phenomenon

LLMs are essentially advanced pattern-matching engines. Research consistently shows that as the context window grows, the model’s ability to retrieve specific facts from the middle of that context sharply degrades. When you bloat the prompt with raw, unparsed API data, the agent loses track of its actual objective. It becomes buried in the noise.

We do not navigate our daily jobs by memorizing entire filing cabinets every morning. We keep a small, focused notebook on our desk. We need to architect the same capability for our agents. We must transition from raw context dumping to Working Memory Architecture.

Image Source: Google Gemini

The Working Memory Architecture (The Solution)

To solve the Context Window Tax, we must look again at human cognition. When a human consultant is asked to analyze a massive 500-page financial report to find three specific data points, they do not attempt to memorize the entire book before speaking.

Instead, they use a desk. They fetch the report, place it on the desk, and use an index or search function to pull out only the three sentences they actually need to hold in their working memory.

We must build this exact “desk” for our agents. In systems architecture, this is called an Agentic Scratchpad or a Session Cache.

Instead of treating the context window as a massive, permanent storage drive, we treat it as highly limited Working Memory. To achieve this, we introduce two new, specialized Epistemic tools to the agent’s registry:

1. write_to_cache(key, payload)

When the agent executes a heavy Payload tool (like fetch_entire_crm_history), we change the underlying execution logic. Instead of the tool returning the massive 10MB JSON string directly back into the LLM's prompt, the tool silently saves that data to a local, session-scoped key-value store (e.g., Redis, or a simple Python dictionary in memory). The tool then returns a tiny, token-light confirmation string to the LLM: "Success: CRM history saved to cache under key 'user_123_history'."

2. query_cache(key, search_parameters)

Now that the heavy data is safely resting on the “desk,” the agent uses its second tool. If the user asks, “When was this customer’s last purchase?”, the agent doesn’t need to read the whole file. It calls query_cache(key='user_123_history', query='last_purchase_date'). The middleware executes a fast, programmatic search (like JSONPath or a simple regex) against the cached data, and returns only the specific date to the context window.

The Architectural Shift

By forcing the agent to route heavy payloads through a Session Cache rather than its own context window, the transformation is immediate and dramatic:

  • Token Costs Plummet: You are no longer paying to re-read 100k tokens on every conversational turn. The context window remains pristine, containing only the user’s prompt, the tool call logs, and the tiny extracted answers.
  • Latency Vanishes: Time-To-First-Token (TTFT) drops from 20 seconds down to sub-second speeds, because the model is only reading a few hundred tokens at a time.
  • Laser-Focused Accuracy: By eliminating the “Context Dump,” we cure the “Lost in the Middle” phenomenon. The agent is no longer hallucinating because it is no longer drowning in noise.

The context window is the most expensive and fragile component of an LLM. By building a cache, we protect it.

Image Source: Google Gemini

The Code Artifact: The Session Cache Manager

We have established the theory and the architecture. Now, we must build the infrastructure.

Implementing a working memory system requires a middleware component that sits between the LLM and your actual databases. This component manages the temporary state of the conversation and ensures that data doesn’t persist beyond its useful lifespan (preventing cross-session contamination).

Below is a Python implementation of a SessionCacheManager. It exposes the necessary tools to the agent while enforcing a strict Time-To-Live (TTL) constraint on the data.

import time
import json
from typing import Dict, Any, Optional

class SessionCacheManager:
"""
A runtime intermediate memory store for an Agent.
Routes heavy payloads to a local dictionary instead of the Context Window.
"""
def __init__(self, ttl_seconds: int = 3600):
# The physical "desk" where data is stored temporarily
self._cache: Dict[str, Dict[str, Any]] = {}
self.ttl_seconds = ttl_seconds

def write_to_cache(self, key: str, payload: Any) -> str:
"""
Tool exposed to the Agent (or called implicitly by heavy payload tools).
Saves data and returns a token-light confirmation to the LLM.
"""
self._cache[key] = {
"data": payload,
"expires_at": time.time() + self.ttl_seconds
}
# The LLM ONLY sees this tiny string, saving thousands of tokens.
return f"SUCCESS: Payload saved to Session Cache under key '{key}'. Expires in {self.ttl_seconds}s."

def read_from_cache(self, key: str, json_path_query: Optional[str] = None) -> str:
"""
Tool exposed to the Agent.
Allows the agent to fetch specific, targeted data points from the heavy payload.
"""
if key not in self._cache:
return f"ERROR: Key '{key}' not found in Cache. You must write it first."

entry = self._cache[key]

# Enforce TTL (Memory wipe)
if time.time() > entry["expires_at"]:
del self._cache[key]
return f"ERROR: Cache key '{key}' has expired. Please fetch the data again."

data = entry["data"]

# If the agent just wants the whole thing (rare, but supported)
if not json_path_query:
# We return a stringified version, but ideally, they use the query.
return json.dumps(data)[:2000] + "... [TRUNCATED. USE json_path_query FOR SPECIFICS]"

# Simulate a JSON Path query to extract ONLY the necessary tokens
extracted_data = self._execute_json_query(data, json_path_query)
return json.dumps(extracted_data)

def _execute_json_query(self, data: Any, query: str) -> Any:
# Helper function to parse the heavy JSON and return the specific node.
# (Implementation depends on your preferred querying library, e.g., jsonpath-ng)
pass

The Outcome

By equipping your agent with this SessionCacheManager, you fundamentally change its behavior. When asked to analyze a complex enterprise dataset, the agent will dynamically write the raw JSON to self._cache, read the tiny confirmation string, and then iteratively use read_from_cache to surgically extract only the variables it needs to formulate an answer.

You have successfully decoupled data fetching from data reasoning. You have engineered a system that is cheaper, faster, and immune to context degradation.

Image Source: Google Gemini

Build the Complete System

This article is part of the Cognitive Agent Architecture series. We are walking through the engineering required to move from a basic chatbot to a secure, deterministic Enterprise Consultant.

To see the full roadmap — including Semantic Graphs (The Brain), Gap Analysis (The Conscience), and Sub-Agent Ecosystems (The Organization) — check out the Master Index below:

The Cognitive Agent Architecture: From Chatbot to Enterprise Consultant


The Agentic Scratchpad: Why Your LLM Needs a Cache Tool was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top