Your Claude Code is Starving, the Food’s Scattered All Over Your Org, and Some of it is Stale

How to build the context layer that spec-driven development assumes already exists and keeping it fresh

Spec-driven development is the right instinct. Write precise intent before you ask an agent to act on it. Define the acceptance criteria, the constraints, and the architectural boundaries. Give the agent a contract rather than a vague description and watch the output improve.

There is a problem with this, and it sits one layer below the spec.

Consider a well-written engineering ticket. Specific file paths. Exact property names. Explicit “do not create this” constraints. A clear definition of done. A ticket that any experienced developer on the team would read and immediately understand. Hand that ticket to an AI agent and ask it to implement the feature.

The agent implements something. It is fast, locally coherent, and syntactically correct. It also misses the implicit contract between the two services that was established three years ago and lives nowhere in writing. Or if it does exist in Confluence, it's probably stale by now. It uses a serialization pattern that is locally consistent but not the canonical one for this domain. It creates a model that should inherit from a base class that the ticket author forgot to mention because they assumed everyone already knew.

None of this is a spec quality problem. The spec was good. It is a context problem — specifically, the problem of context that exists nowhere in the codebase, in any ticket, or in any document that an agent can read. It lives in the heads of the three senior engineers who have been there long enough to know.

This is the layer that spec-driven development tools do not address. And it is the layer that determines whether AI-assisted development on a mature codebase produces fast, correct output or fast, subtly wrong output that requires a senior engineer to catch.

What chunking destroys

The standard response to this problem is retrieval-augmented generation: build a vector database, chunk your documents into 500-token pieces, embed them, and retrieve the most semantically similar chunks when a query arrives.

The chunking critique is right and worth saying plainly: cosine similarity on 500-token pieces is a genuinely bad way to find an answer that spans three pages. The answer to “why do we use this Redis TTL” might be in an ADR from 2021, an implementation note in a PR from 2022, and a comment in a Confluence page that nobody has looked at since. Chunk all three into 500-token fragments and hope the retrieval finds the right ones. It usually doesn’t. The context is three pages up, or in the document next to the one that was retrieved, or requires reading two sources together to understand.

Andrej Karpathy recently described a similar instinct for personal research: index raw documents into a folder, let an LLM compile a wiki, ask questions against the index. No embeddings, no chunking, no retrieval infrastructure. At a personal research scale with static documents, it works.

Satish Venkatakrishnan’s observation — build a librarian index instead, summarize each document to ~200 tokens, let the LLM read the index and fetch whole files — is a genuinely better architecture for bounded, relatively static document sets. The librarian reads the card catalog, picks the right books, and reads them in full. No chunking, no context destruction, no semantic search prayer.

The limitation surfaces at the enterprise scale, and it has two parts.

The first is the dynamic layer. The librarian index works when the card catalog reflects reality. In a large engineering organization, the knowledge that matters most changes constantly. A PR merge updates how a component works. A closed ticket establishes a new pattern. An architecture decision gets revised. Rebuilding the full index on every change is expensive. Not rebuilding it means the librarian is reading a stale catalog and confidently fetching the wrong books.

The second is the knowledge that has never been written down at all. No document set captures the implicit contracts between services that were established in a room where everyone was present. No index contains the reasoning behind a decision that was made verbally and never recorded. The most consequential institutional knowledge is also the most invisible — not because anyone chose to hide it, but because when you already know something, you stop thinking of it as knowledge worth writing down.

There is a third layer worth naming: the knowledge the KB itself discovers over time. Every query an agent makes against the KB is a signal — which components are queried most, which answers require pulling from three disparate sources, and which queries returned low confidence. That signal feeds back into the KB as a prioritization guide. The components queried most often with the lowest confidence are the ones with the thinnest documentation. The queries that required stitching together three sources suggest the need for a missing synthesis document.

The missing layer

What an AI agent actually needs before it starts implementing is not just the spec. It needs what a new senior engineer joining the team would spend six months acquiring: why is this system structured the way it is, which patterns are canonical versus accidental, what are the implicit contracts between components, what has been tried and failed, and what does “follows existing patterns” actually mean for this specific domain.

Call this the context layer. It is the difference between an agent that implements quickly and one that implements correctly.

The context layer does not exist as a single artifact in most engineering organizations. It is distributed across codebases, tickets, Confluence pages, PR history, review comments, and the heads of the engineers who have been there the longest. Some of it is in ADRs that haven’t been updated since the decision was made. Some of it is in PR comments that nobody indexed. Most of it is in the response a senior engineer gives when a junior asks, “Why does this work this way?”

Extracting, structuring, and making this layer queryable — in a form that agents can use before they start, not just humans browsing after the fact — is the actual infrastructure problem that enterprise-scale AI-assisted development requires.

What the architecture looks like

The right architecture for this problem has four components.

Raw sources. The knowledge already exists as a byproduct of normal engineering work: merged PRs, closed Jira tickets, Confluence pages, code across repos, automated review output, and integration tests. The challenge is not creating new knowledge — it is extracting and structuring what is already being produced.

Event-driven extraction agents. Each source layer has a dedicated extraction agent that runs on a trigger: a PR merge, a ticket closure, a Confluence page edit on a tagged page. The agent reads the raw source, extracts structured knowledge according to a predefined schema, and writes it to the knowledge base as a PR rather than a direct commit. The PR-not-direct-commit pattern matters — it creates a review surface, maintains a clean git history, and allows humans to audit what the agents are writing before it becomes authoritative.

The extraction prompt for each agent is not a generic summarization instruction. It is schema-specific: extract the architectural decision and its rationale, extract the canonical pattern and its exceptions, and extract what was rejected and why. The agent rates its own extraction confidence, which determines the auto-merge-versus-human-review threshold.

A compiled wiki. All extracted content is stored in a Git-backed repository of structured Markdown. Git backing provides version history, diff-based review, and familiar tooling. The directory structure is organized by domain and component, not by source type. An entry for a component contains its overview, conventions, dependencies, the decisions that shaped it, canonical examples, and quality signals — all from different source types, compiled into a single queryable entry.

The wiki is not documentation written for human readers, nor is it documentation maintained by humans. Karpathy’s framing is right: the KB is the LLM’s domain. Humans author in their natural tools — writing PRs, closing tickets, and editing Confluence pages. Extraction agents compile the KB from that output. Humans audit what agents write, resolve flagged conflicts, and author ADRs when a human decision is needed. But the KB's primary authors are the agents themselves, and its primary readers are agents as well. The human role is auditor, not author. This is what breaks the arc of every previous documentation effort: the inputs are the work, not a separate activity that gets deprioritized when the sprint starts.

A retrieval layer with two modes. The wiki is queried through an MCP server that exposes a small set of tools to agents: get a component overview, get the ADRs for a domain, search the KB. The retrieval layer supports both navigational queries — the agent knows what it is looking for and needs to fetch it precisely — and conceptual queries, where the agent needs semantic search to find something it cannot name. Navigational queries use keyword retrieval. Conceptual queries use vector search over embedded wiki chunks. The key design decision is chunking by semantic section rather than by token count — each ## A section in a component file becomes a single chunk, which works because the consistent KB schema makes section boundaries reliable semantic boundaries.

The MCP server returns structured markdown with source links, a staleness flag, and a confidence field. The calling agent decides how much context to include in its window based on relevance scores. The retrieval layer returns a ranked list; the agent consumes as much as is useful.

What to extract and from where

The highest-value sources, in order of signal density:

PR history is underrated and largely ignored. Merged PRs are documented decisions. They show what was proposed, what the review pushed back on, what changed before the merge, and — critically — which approaches were rejected and why. An extraction agent that reads PR descriptions and review comments and produces structured “this approach was rejected because X” entries is capturing tacit knowledge that lives nowhere else. The PR corpus of the last 90 days is a better source of current architectural conventions than any Confluence page.

Architecture decision records are the highest-value, lowest-volume content type. Most organizations have some ADRs, inconsistently formatted and scattered across Confluence. Normalizing them into a consistent schema — status, decision, context, alternatives considered, consequences — and importing them into the KB as first-class entries requires one-time effort and pays indefinitely. An agent that can retrieve the ADR for a decision understands not just what the pattern is but why it exists and what would break if it changed.

Closed Jira tickets with well-written acceptance criteria reveal recurring patterns. Across 50 tickets in the same domain, the implementation decisions that recur are the domain’s actual conventions, whether or not anyone has written them down explicitly. An extraction agent that reads the closed-ticket corpus and identifies recurring patterns is doing the work that nobody had time to do when those patterns were being established.

Automated review output accumulated over time tells you which components generate the most review flags, which classes of issues recur, and which teams have the highest flag rates. This is a quality signal, not a context signal, but it identifies where the context layer is thinnest — the components with the highest review flags are the ones where the KB most needs to be built.

Code sweep across repos provides the structural layer: what exists, where it lives, which team owns it based on PR history, what the dependency graph looks like, and what naming conventions have emerged per domain. The initial sweep is one-time and expensive. Incremental maintenance on PR merge is cheap — the extraction agent reads the diff, identifies which components were touched, and updates the relevant KB entries.

What not to extract: Slack history in bulk (too noisy, too ephemeral), all of Confluence (most of it is meeting notes and status updates), in-progress tickets (too unstable). The KB should be curated, not a full mirror.

The operational reality

The system is designed to minimize ongoing human maintenance, but it does not eliminate it. A realistic operational model has three ongoing activities.

Reviewing extraction PRs. Once a week, thirty minutes. The extraction agents write to the KB as PRs. A rotating responsibility on the platform team reviews them — not line by line, but for gross errors and confidence-flagged entries. As auto-merge thresholds stabilize, this cost drops.

Resolving conflicts. When two sources contradict each other — a PR description implies one convention, a recent ticket implies another — the extraction agent writes both to the KB, flags the conflict, and tags the entry as contested. A human writes an ADR that becomes the authoritative source. This happens occasionally, not continuously.

Benchmarking retrieval quality. Quarterly. The quality of keyword retrieval is easy to verify. Vector retrieval quality drifts as the KB grows and the embedding model’s behavior relative to real agent queries shifts. A quarterly benchmark against a representative set of saved agent queries — did the right chunks come back, did the ranking make sense — catches retrieval degradation before it affects agent output quality.

The system also needs a staleness monitor. Every KB entry carries a last_updated timestamp and a source_updated timestamp. When a source changes and the KB has not been updated, the entry is flagged as potentially stale, and the agent querying it receives a staleness signal along with the content. A nightly reconciliation job compares the chunk count in the vector index against the section count in the git-backed markdown and alerts on mismatches -- the most common silent failure mode is a KB write that succeeds, but the downstream index update fails.

A scheduled linting agent that reads the KB and proactively flags inconsistencies, imputes missing connections, and identifies gaps before they cause a bad agent output is not an optional feature — it is what separates a KB that improves over time from one that drifts.

The hard part is not the architecture

The architecture described here is not complicated. The components are well-understood: git-backed markdown, event-driven extraction agents, MCP server, and hybrid retrieval. None of this requires novel technology.

The hard part is the organizational decision to treat the context layer as infrastructure to build and maintain, rather than as a documentation project that will be deprioritized as soon as the next sprint starts.

Most documentation efforts in engineering organizations follow the same arc: someone decides the team needs better docs, a documentation sprint produces a set of pages that are accurate at the moment of writing, the pages drift as the codebase evolves, nobody maintains them because maintaining documentation is never in the sprint, and eighteen months later the pages are more misleading than no documentation at all.

The KB architecture described here is specifically designed to break that arc. The inputs are the work that teams are already doing — writing PRs, closing tickets, and editing Confluence pages. The extraction agents derive the KB from that work rather than requiring a separate documentation activity. Staleness is monitored rather than assumed. The KB stays current because its inputs are up to date, not because someone has time to maintain it.

The prerequisite is instrumenting the pipeline well enough to know whether the KB is actually working — whether agents using it produce better output, whether retrieval finds the right context, and whether the right stages of the delivery pipeline are improving. That measurement discipline is the same discipline required to find the binding constraint in the first place. The KB is one intervention in a systematic pipeline redesign. Whether it is the right intervention to make first depends on whether context transfer is actually the binding constraint in your specific pipeline.

If it is, this is what fixing it looks like. But fixing the context layer is not the end of the story.

What comes after context retrieval

The architecture described here is a context-window solution. Knowledge is extracted, structured, and retrieved into an agent’s context at query time. The agent knows what it needs because the KB told it.

The end state is different. Karpathy points at it: synthetic data generation and finetuning, so the model knows your organization’s knowledge in its weights rather than retrieving it from context. A model fine-tuned on your KB doesn’t need to query the Yodlee request pattern or the Redis TTL rationale — it already knows, the way a senior engineer who has been there for five years already knows. The retrieval step disappears. The latency drops. The context window is freed for the task, not the background.

That is not today’s architecture. Finetuning on a living KB requires the KB to be stable enough to train on, the training loop to be fast enough to keep pace with changes, and the evaluation infrastructure to know when the finetuned model has regressed. None of that is trivial. But it is the direction, and building the KB now is the prerequisite. You cannot finetune on knowledge that was never extracted and structured in the first place.

Karpathy called it “room for an incredible new product instead of a hacky collection of scripts.” The KB architecture described here is a hacky collection of scripts. The incredible product is what it becomes when the retrieval layer, the linting loop, and the finetuning pipeline are unified into something that learns continuously from the work your organization is already doing. That product does not exist yet. The infrastructure for it does.

The author works on AI engineering and platform architecture. This is the second in a series on the AI-native software development lifecycle. The first article — “It’s not Claude Code, silly — your SDLC is not AI-native” — covers the instrumentation-first approach to finding and systematically eliminating human bottlenecks in the delivery pipeline.

Sources: Satish Venkatakrishnan, “LLM Knowledge Bases” (LinkedIn, 2026); Andrej Karpathy, “LLM Knowledge Bases” (X, 2026); GitHub Spec Kit documentation; Augment Code Intent documentation.


Your Claude Code is Starving, the Food’s Scattered All Over Your Org, and Some of it is Stale was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top