I've been digging into how AI coding agents actually handle memory — not what the marketing says, but what the code and benchmarks show. Here's what I found.
TL;DR: Every agent memory system in 2026 is either too simple (can't search), too expensive (600K tokens per conversation), or too clever (burns tokens on memory management instead of actual work). The real unsolved problem isn't remembering — it's forgetting.
How the major systems actually work
Claude Code — Reads CLAUDE.md at session start. Entire file goes into context. No vector DB, no semantic search. Auto memory (v2.1.59+) writes notes to markdown files. Hard cap: 200 lines for MEMORY.md, everything beyond silently truncated.
Intentionally simple. Works for small projects. Falls apart on monorepos with years of decisions.
Mem0 (48K stars) — Decomposes interactions into facts, stores as embeddings, retrieves via semantic search. Sounds great until you check the numbers:
| System | LongMemEval | Tokens per conversation |
|---|---|---|
| Mem0 | 49.0% | ~1,764 |
| Zep | 63.8% | ~600,000 |
| Letta | ~83.2% | Dynamic |
Mem0 recalls the right information less than half the time. Zep is better — but uses 340x more memory for 15 points of accuracy. The Zep team disputes the Mem0 paper's methodology, claiming 75.1% with proper configuration. Even so.
Letta/MemGPT — Treats context window like RAM, external storage like disk. Agent decides what to page in and out. Best benchmark score (~83.2%). But every memory operation costs inference tokens. The agent spends significant budget reasoning about what to remember instead of doing the work.
The actual problem: no agent knows how to forget
Ebbinghaus mapped the human forgetting curve in 1885. We don't keep everything. We forget most things. What survives got reinforced through repetition or significance.
AI agents have two modes: hoard everything (vector stores growing forever) or lose everything (session boundary wipes the slate). There's no middle ground.
Claude Code's leaked source (March 31 npm packaging error) hints at the right direction. There's a DreamTask module that runs during idle time — consolidating memories, merging duplicates. The codebase literally calls it "dreaming." But it's primitive. A memoryAge.ts module appends text warnings like "This memory is 47 days old" — but the system doesn't actually reduce the memory's weight or trigger re-verification. It's a label, not a mechanism.
What we need: active curation. A system that continuously evaluates what's worth keeping, what should decay, and what should be promoted from short-term to long-term. Not "store and search" — "curate and forget."
This gets way harder with multiple agents
Claude Code's subagents share a CLAUDE.md file. Agent A writes, Agent B picks it up on next read. Works for 2-3 agents. At 20+ agents making concurrent decisions? Write conflicts, stale reads, contradictory entries nobody reconciles.
Research in agent-based social simulation (Stanford's Generative Agents, Tsinghua's AgentSociety) has been hitting these problems for years at 100+ agent scale. Questions that no production system answers:
- If 50 agents independently store the same fact, is it more reliable or just more popular?
- When two agents have contradictory memories, how do you resolve without picking an arbitrary winner?
- When does a group "forget" something — when every individual forgets, or when it stops being referenced?
These aren't academic curiosities. They're the exact problems any multi-agent coding setup will face at scale.
My take
After studying all of these, I think the field is stuck on the wrong framing. Memory isn't a storage problem. It's a coordination and curation problem. The pieces that seem necessary:
- Tiered personal memory with explicit promotion/demotion rules
- Shared state as a protocol (not a shared file)
- Active forgetting — relevance decay weighted by usage and cross-agent reinforcement
- Conflict as first-class data — maintain disagreements instead of silently picking winners
The Meta-Harness paper (Stanford/MIT, March 2026) showed that harness design alone produces a 6x performance gap on the same model. Memory is probably the highest-leverage harness component still wide open.
The agent that wins won't remember the most. It'll forget the best.
What's your actual memory setup? Anyone found something that works across sessions without massive overhead?
[link] [comments]