The Conversation Storage Problem Nobody Is Solving Yet

Every company that has deployed an LLM in the last two years is now sitting on an unusual asset: a vast and rapidly growing collection of conversations between humans and machines. Customer support chats. Internal research queries. Code generation sessions. Debugging exchanges. Drafts that nobody finished. Tests that nobody cleaned up.

This data isn’t going anywhere. Most of it isn’t deleted, isn’t compressed, and isn’t curated — it just accumulates. A mid-sized company with a few thousand active users might generate millions of conversations a year. At enterprise scale, the numbers become staggering. And nobody has a good answer for what to do with it.

I don’t mean this as a doomsday claim. I mean it literally: when I started looking at this, I couldn’t find anyone making intelligent decisions about which conversations were worth keeping. Companies are storing everything by default and worrying about it later. “Later” is going to be expensive.

Why this is harder than it sounds

The instinctive reaction is “just delete old stuff.” That’s how most data gets handled. Cloud providers offer lifecycle policies — S3 Intelligent-Tiering, Azure Blob lifecycle rules, Google Cloud’s object lifecycle management — that automatically move data to cheaper storage tiers or delete it after a fixed period. These are mature, well-engineered tools. They’ve worked fine for decades for things like log files and backups.

They don’t work for conversation data.

The reason is that the value of a conversation isn’t structural. It’s semantic. A 90-day-old conversation might be:

  • A throwaway debugging session that should have been deleted the day it ended
  • A genuinely insightful customer interaction that documented a real product issue
  • A research thread that someone is still actively building on
  • A failed experiment that contains the seed of the next experiment

You cannot tell which is which from metadata. Size, age, and access count get you partway, but they all converge on the same blind spot: a 90-day-old conversation is just a 90-day-old conversation, regardless of whether it’s gold or garbage. Treating them all the same — which is exactly what fixed-period lifecycle rules do — means you either keep too much (expensive) or delete too much (worse).

To make these decisions intelligently, you need to actually understand what each conversation is about. And until very recently, doing that at any meaningful scale wasn’t economically feasible.

The idea

So I built one. It’s called the Data Lifecycle Agent, and the question it asks of every conversation is:

Is the cost of keeping this conversation higher than the cost of losing it — and is it worth paying for me to figure that out?

Three things get weighed:

Storage cost. What this conversation actually costs to keep, per day, projected forward. This is straightforward — bytes times rate times time.

Semantic value. How unique the content is, and how useful it’s likely to be in the future. This is the hard part. The agent uses Claude (Anthropic’s API) to score each conversation on uniqueness and utility, based on a structured summary of its metadata.

The agent’s own cost. This is the part most cost-optimization tools ignore, and the part I find most interesting. Every API call costs money. If the agent runs blindly on every conversation, it will inevitably encounter conversations where its own analysis costs more than the storage it could save. On those, the agent isn’t optimizing — it’s actively destroying value.

So the agent has to be honest about itself. It has to estimate, before it runs, whether running is worth it.

How the trade actually gets made

Most agents are designed to run. This one is designed to first decide whether to run, and only then proceed.

The pipeline has five stages, and the goal of each early stage is to filter aggressively so the expensive stages handle as little work as possible.

It starts with metadata summaries. The agent takes each conversation’s existing metadata — size, age, access patterns, last accessed time — and generates a short text description. This costs nothing; it’s just structured strings.

Then a heuristic pre-screen. Simple rules handle the easy cases. A conversation accessed yesterday gets classified as KEEP without any API call. A conversation that’s only 5 KB and a week old also gets a KEEP. A safety-flagged conversation also gets a KEEP. The principle: if the answer is obvious to a rule, don’t pay an LLM to confirm it.

Surviving conversations get clustered. The agent groups them by size, age, and a recency-weighted engagement score. Fifty conversations that all look like “large, old, never accessed” don’t need fifty separate judgments. They get grouped into one cluster, and a single representative is sent for scoring on behalf of the whole group.

Then comes the gate. Before any API call happens, the agent calculates the total potential storage savings across all clusters, then estimates the total cost of running its analyses. If the cost exceeds the savings, the agent stands down — the entire run is canceled, no API calls are made, and the result is logged. This is the moment the agent demonstrates economic self-awareness: it knows when not to bother.

For the clusters that survive the gate, only then does the agent call Claude for semantic scoring. The verdict on each cluster representative — KEEP, COMPRESS, or DELETE — gets applied to all members of that cluster, with the API cost split proportionally based on each member’s share of the savings.

Finally, verdicts get written to a database with a confidence score and reasoning. Destructive actions (COMPRESS or DELETE) don’t execute automatically. They wait for a human to confirm them in the dashboard.

What it can actually do today

Run the agent against a database of conversations and you get back, for each one:

  • A verdict (KEEP, COMPRESS, or DELETE) with a confidence percentage
  • The reasoning Claude used to arrive at that verdict
  • The exact storage saving the action would produce
  • The agent’s own cost for that decision
  • A net saving figure, projected across the next 3, 6, and 12 months

Pending verdicts show up in a dashboard. A human reviews them, confirms or rejects, and that’s the loop. The agent recommends. The human decides.

For runs that don’t justify their cost, the agent stands down cleanly and explains why. The same dashboard shows historical run data — how many conversations got processed, how much was saved, how much was spent — so the system’s track record is auditable over time.

Everything is logged to an append-only audit table that no application code can update or delete. If something goes wrong, or if anyone questions a decision, the full history is recoverable.

What’s still missing

This is where I want to be honest, because it’s the difference between describing a working prototype and overselling a product.

Even after I had the system running end-to-end, I found a bug in the cost estimator that was causing the agent to refuse to run on exactly the conversations it should help with. The fix was small, but the lesson was real: when economic reasoning is happening behind the scenes, getting any of the inputs wrong silently breaks the whole thing.

Beyond that, there are several places where the system is clearly a prototype rather than a production product:

Clustering is bucket-based, not similarity-based. Conversations get grouped by discrete size and age buckets. This works fine when content patterns are roughly uniform, but real-world conversation diversity would expose its limits. A proper embedding-based similarity score would be more robust.

Cost estimation uses a fixed constant. The agent assumes each API call costs roughly 700 tokens. That’s a reasonable approximation, calibrated against real calls, but a production system should be measuring this continuously and adjusting. I’d want to see the agent learn its own cost over time rather than rely on a hand-tuned number.

The dashboard’s cumulative savings number is naive. It counts runs where the agent decided to keep everything, and runs where the agent stood down without acting, the same way it counts runs where compression actually happened. The result is a number that looks worse than reality.

The scheduler runs in-process. It’s a FastAPI background task today. For real scale, this needs to become a proper distributed task queue (Celery + Redis or similar), so the scheduler runs on a real cron schedule independent of the API server.

There are no multi-tenant retention policies yet. The thresholds are global. Real organizations need per-team or per-user policies; the schema has the hook for this, but the logic doesn’t read from it yet.

These aren’t bugs. They’re the gap between “demonstrates the concept” and “handles real production workloads,” and that gap is wider than most prototypes acknowledge.

The bigger point

The interesting question, to me, isn’t whether this particular agent is the right shape. It’s whether the broader pattern — agents that reason about their own cost as a first-class concern — is going to become standard.

I think it has to. As LLMs get embedded into more workflows, the cost of using them will compound. Cost-blind automation is fine when each unit of work is essentially free, like a database query. It’s not fine when each unit of work has a measurable price tag and an even less predictable value.

The right model for agents that operate at scale isn’t “run on everything and hope.” It’s something closer to a thoughtful contractor: looks at the job, estimates the work, decides whether the work is worth the fee, and only then proceeds. That self-awareness has to be built into the agent’s design — it’s not something you can bolt on later as a wrapper.

Conversation data is just the first wave. Whatever else we build agents to manage in the next few years — files, emails, knowledge bases, memory systems — the same pattern is going to apply. The agents worth running will be the ones that know when not to.

The full code is at github.com/SandeepPalugula/data-lifecycle-agent. The README has the architecture diagram, design decisions, and a couple of debugging stories I left out here.

A note on how this got built: I designed the system, drove the testing, and debugged the problems. The code itself was written collaboratively with Claude, Anthropic’s AI assistant. I think this is going to become an unremarkable way to build software soon, but right now it still feels worth naming.


The Conversation Storage Problem Nobody Is Solving Yet was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top