Most AI Agent Memory Systems Are Broken, Here’s Why

Most AI Agent Memory Systems Are Broken, Here’s Why

A concise tour of Hermes Agent memory — MEMORY.md, USER.md, prefetch/sync, and when session search is not memory.

You’ve probably experienced it: an AI agent that feels brilliant for 20 minutes, then completely forgets everything the moment you close the tab. You come back the next day and it’s like talking to a stranger.

The industry’s response has been to build bigger and bigger memory systems — vector databases, knowledge graphs, retrieval pipelines. The problem isn’t that memory systems are too small. The problem is that they treat memory like a database instead of a brain.

Why This Matters

If your agent can’t remember your preferences, your project conventions, or mistakes it made last week, it’s not an agent — it’s a slightly smarter autocomplete. Every session starts from zero. Every correction has to be repeated. Every workflow has to be re-explained. That’s not a usability issue; it’s an architectural failure. And it compounds quickly: the more you rely on an agent, the more painful the amnesia becomes.

TL;DR

  • Hermes keeps persistent agent state in two capped files (MEMORY.md / USER.md) injected into the system prompt — curation by design, not an endless retrieval pile.
  • Memory runs in two layers: those files are always on; you can add exactly one external provider at a time for heavier retrieval and retention (Honcho, Hindsight, Mem0, etc.).
  • Each user turn is recall then store for external backends — prefetch context before the reply, sync dialogue afterward; the core files stay in the frozen prompt and do not go through prefetch.
  • Optional plugins expose recall_mode (automatic prefetch only, tools only, or both); session transcript search is a separate tool from long-term memory.
  • Bounded space plus consolidation beats infinite logs — forgetting noise is part of the design.

The AI Agent Memory Problem

Large Language Models are stateless by design. Each request is independent. Each response is generated from whatever prompt you send right now. There is no memory, no history, no continuity beyond the tokens in the current context window.

For a single question-and-answer exchange, this is fine. But agents are supposed to do things across sessions. They should learn from mistakes, adapt to preferences, and build working knowledge over time. Statelessness makes that impossible without intervention.

The obvious intervention is to add context. Attach the previous conversation. Include project documentation. Send the entire history. Context windows are growing — 128K tokens and beyond — so theoretically you can fit everything in there.

But context is not memory. Context is a dump. Memory is a distillation.

Context has no curation. As it grows, the model processes thousands of tokens of irrelevant history to find the one fact it needs. That costs tokens, compounds latency, and eventually degrades performance. Memory, by contrast, is the compressed essence of experience — small, structured, and always available.

Human memory works the same way. You don’t remember every conversation you’ve ever had. You remember the parts that matter: who you’re talking to, what they care about, what you’ve agreed on. The rest is either forgotten or searchable when you need it.

What Most Frameworks Get Wrong

The AI agent memory space has exploded since 2024. Letta reached 21,000 GitHub stars with its three-tier memory model. Zep and Graphiti built temporal entity tracking. Mem0 grew to 48,000 stars with server-side memory extraction. Databricks published research on “memory scaling.” Cognee built knowledge extraction pipelines with 30+ connectors.

They all share a fundamental flaw: they treat memory as a retrieval problem. Store it somewhere. Query it when needed. Inject the results into context.

This approach introduces three problems:

  1. Latency — every memory access can mean tool calls, queries, and summarization. What should feel instant becomes a pipeline.
  2. Noise — retrieved chunks compete with the current task for attention.
  3. Complexity — vectors, embedders, graphs, and indexing are a lot to operate for a personal agent.

The common pattern across all these frameworks is the same: memory is something the agent retrieves when it needs it. That’s the wrong mental model for identity-level preferences and conventions.

How Hermes Structures Memory

Hermes Agent puts persistent state in the system prompt — curated, bounded, always active — instead of defaulting to “store everything, retrieve later.” Read this section in order: layers and per-turn flow first, then what lives in each file.

Two layers

  1. Built-in — MEMORY.md (2,200 characters) and USER.md (1,375 characters), file-backed, always loaded. Together under ~3,600 characters and ~1,300 tokens on purpose.
  2. Optional external provider — one plugin at a time (Honcho, OpenViking, Mem0, Hindsight, Holographic, RetainDB, ByteRover, Supermemory, …). It sits beside the core files; it does not replace them.

Recall before the reply, store after

For external backends, each turn follows the same skeleton the codebase exposes — MemoryManager.prefetch_all(query) before the model answers (each backend runs provider.prefetch(query) against its store). After the assistant message, MemoryManager.sync_all(user, assistant) runs provider.sync_turn, then provider.queue_prefetch can prepare retrieval toward the next turn.

MEMORY.md and USER.md are not fetched through prefetch_all; they are already part of the frozen system prompt.

user message
-> prefetch_all(query) -> provider.prefetch(query)
-> context for this turn -> model -> assistant message
-> sync_all(user, assistant) -> sync_turn + queue_prefetch

How writes and reads actually land

Long-term updates arrive three ways — the built-in memory tool (add / replace / remove, memory vs user targets); passive retention during each provider's sync path (behaviour varies — Hindsight batches retention, Honcho runs dialectic ingestion, Mem0-style stacks extract facts from turns); and explicit provider tools when exposed (honcho_conclude, hindsight_retain, honcho_profile, and peers).

Reads split between automatic prefetch injection and explicit tools (honcho_search, honcho_reasoning, hindsight_recall, hindsight_reflect, …). Plugins expose recall_mode next to memory.provider — context-only (inject only), tools-only, or hybrid — trading tokens against control.

Session search versus long-term memory

session_search searches past conversation transcripts when the question sounds like "we discussed this before." Core files and provider stores hold durable facts that should survive sessions — a different contract than rifling through chat logs.

What the two files contain

MEMORY.md holds agent-side notes — environment, project conventions, tool quirks. USER.md holds the profile — identity, preferences, habits.

At session start both load as a frozen block in the system prompt. Headers show usage; entries stay dense and declarative.

Dense signal:

User's project is a Go microservice at ~/code/gateway using gRPC + PostgreSQL
This machine runs Ubuntu 22.04, has Docker and kubectl installed
User prefers snake_case for variable names and avoids camelCase

Verbose noise:

On January 5th, 2026, the user asked me to look at their project which
is located at ~/code/gateway and it uses Go with gRPC and PostgreSQL
for the database layer. The user mentioned they prefer snake_case for
variable names and explicitly said they avoid camelCase formatting.

The first is 156 characters of signal; the second is 307 characters of mostly filler. The character caps exist to force that compression.

Why Character Limits Force Better Memory

The 2,200 and 1,375 limits are not accidental ceilings. They enforce curation — merge, compress, drop fluff — instead of infinite append-only logs.

When memory is bounded, full buffers trigger an explicit consolidate-and-merge workflow rather than silent failure.

Because the memory block is frozen for the session, the model can use prefix caching — pay attention once for the static prefix, then mostly roll forward on new dialogue turns. That stays fast without re-encoding the same memory tokens every time.

For broader context on self-hosted stacks (routing, orchestration, where memory sits), see the local AI systems guide.

How the Agent Decides What to Remember

The agent uses one tool with three actions: add, replace, remove. There is no separate read action — injected content is already in the prompt.

Rough priority:

  • User corrections and explicit instructions — save immediately.
  • Inferred preferences — tend toward USER.md.
  • Environment facts and project conventions — tend toward MEMORY.md.
  • Skip trivial detail, raw dumps, session-only fluff, and duplicates of what’s already in workspace prompt files.

After heavy multi-step work it may save durable lessons or quirks worth repeating later.

The Distillation Pattern

An agent reads a long paper on memory scaling. It does not paste the paper into memory. It keeps one line:

Memory scaling: agent performance improves with accumulated experience through user interaction and business context stored in memory.

External docs and repos are the library; internal memory is the working distillate you carry every session.

When to Use Bounded Memory

  • Preferences, identity, environment facts, conventions, durable lessons.
  • Anything that should change how the agent behaves before it reads the next user message.

When to Avoid Bounded Memory

  • Whole-document retrieval — use RAG or a knowledge base.
  • “What did we say last month?” — use session search or history tools.
  • Large structured datasets — use a database.

The Philosophy: Why Forgetting Is a Feature

The instinct is to store everything. Hermes argues the opposite for agent identity memory — limited space forces signal.

Curation beats bulk. A thousand-token frozen core is faster and clearer than repeated retrieval over megabytes of chatter.

Noise compounds — more remembered text is not smarter text.

Forgetting is maintenance — remove, replace, compress when facts drift.

Databricks-style “memory scaling” research lines up — quality of what you retain beats raw volume.

What This Means

Memory is becoming the differentiator for agents, not just model weights. Two setups with the same base model diverge fast when one carries curated continuity and the other starts cold.

The answer is not an infinitely wide database by default. It is a small, sharp slice the agent always carries — plus optional backends when you consciously opt into their cost and complexity.

This is how an AI agent remembers you. Not by storing everything, but by remembering what matters.

👉 Hermes Agent Memory System: How Persistent AI Memory Actually Works


Most AI Agent Memory Systems Are Broken, Here's Why was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top