LAI #123: Claude Code’s Codebase Was Accidentally Leaked

What we found inside, plus few-shot ordering tricks, DPO vs. GRPO, agentic RAG on Azure, and more!

Good morning, AI enthusiasts!

Claude Code’s entire codebase was accidentally leaked a couple of weeks ago, and I went through it. This week, I share what we found about how it handles memory, compacts conversations, runs background agents, and more.

We also cover:

Why flat tool registries cause agent failures and how an ontology layer that gates access based on state fixes it.
Where SFT hits a ceiling and when to reach for DPO or GRPO: two post-training methods that solve different problems at very different compute costs.
A full walkthrough of agentic RAG on Azure, from query rewriting and hybrid search to a self-correction loop that re-searches when quality falls short.
How Claude Code Skills work under the hood and the false-trigger problem that breaks every skill library past ten entries.
Claude Code’s subagent delegation model: four agent types, isolated context windows, and a hub-and-spoke pattern that keeps debugging tractable.

Let’s get into it!

What’s AI Weekly

A couple of weeks ago, Claude Code’s entire codebase was accidentally leaked, giving us a sneak peek at what is behind this system. This week in What’s AI, I will share what I found after digging into it, including the memory system, conversation compaction, background agents, architecture, strategies for optimizing performance, and more. Watch the complete update on YouTube.

AI Tip of the Day

Few-shot examples are not interchangeable. LLMs tend to show recency bias, meaning the last example in your prompt often has a disproportionate influence on the output. If you place your hardest edge case last, you risk biasing every response toward that edge case. Put your strongest, cleanest example last instead. Edge cases belong in the middle, not at the end.

This applies to any task in which examples shape the format, tone, or structure. It’s worth testing this before you jump to fine-tuning.

In our testing, reordering the same set of examples improved output consistency more than adding additional examples did. Before you invest in fine-tuning, try systematically reordering and evaluating your few-shot examples. In many cases, the prompt you already have is good enough; it’s just structured wrong.

If you’re building prompting or RAG pipelines and want to go deeper into prompt evaluation and iteration, this is one of the techniques we cover hands-on in our Full Stack AI Engineering course.

— Louis-François Bouchard, Towards AI Co-founder & Head of Community

Learn AI Together Community Section!

Featured Community post from the Discord

Yks1309 just shipped financial-hub-mcp, an open-source TypeScript MCP server for financial data aggregation. It connects any MCP-compatible AI assistant to SEC EDGAR filings, XBRL financial statements, FRED economic indicators, and real-time market data. It has built-in XBRL normalization, fact deduplication, computed analytics, stock screening, and rate-limit protection. Check it out on GitHub and support a fellow community member. If you have any questions or suggestions for improving it, share them in the thread!

AI poll of the week!

Gemma 4 comes out on top, with Qwen 3.5 right behind, and then a long tail (GLM, MiniMax, DeepSeek, GPT-oss, Qwen Coder, plus “Other”). This shows people are optimizing for different constraints. Qwen’s share signals that pure capability and momentum from Chinese labs still matter a lot, but Gemma 4 winning suggests something else is rising in importance: models that are easy to run cleanly, feel straightforward to deploy, and come with licensing/origin stories that don’t complicate compliance conversations.

When you recommend an open model, what are you really optimizing for first: cost to serve, ease of self-hosting, license/compliance comfort, or raw quality on your tasks, and what kind of project are you building that makes that tradeoff worth it? Let’s talk in the thread!

Collaboration Opportunities

The Learn AI Together Discord community is flooding with collaboration opportunities. If you are excited to dive into applied AI, want a study partner, or even want to find a partner for your passion project, join the collaboration channel! Keep an eye on this section, too — we share cool opportunities every week!

1. Lingyu_70906 is interested in learning DeepRL and wants a study partner to share ideas. Connect with him in the thread if you want to study together.

2. Xinerent is an AI automation and productivity platform and is looking for AI engineers who want to build privacy-first AI tools. If this sounds interesting, reach out to them in the thread!

3. Lyraluthuin is working on an AI system to create a 3D model from human-drawn lines and is looking to connect with people in computer graphics animation technology to help scope out the remaining work. If this is your space, contact her in the thread!

4. Augmnt_sh is building an Agent Observability Protocol and is looking for a few people who are building autonomous agents and want to try AOP on their stack, interested in contributing to the SDK, and into the observability/dev tooling space. If this sounds like you, connect with them in the thread!

Meme of the week!

Meme shared by efficientnet_99825

TAI Curated Section

Article of the week

Designing Ontology-Aware Tooling for Agents By Suraj Pandey

Flat tool registries lead to agent failures because they lack semantic context about what each tool is for. This article shows you how to build a structured solution: an ontology layer that classifies tools by category, scope, and confirmation level, and then dynamically gates agent access based on state. The stack includes an OntologyAwareRouter that hides destructive tools until preconditions are met, a ToolPlanValidator that catches conflicting calls before execution, and a LangGraph confirmation gate that structurally enforces human sign-off.

Our must-read articles

1. Post-SFT Alignment with DPO and GRPO: How to Fine-Tune Correctly By Suchitra Malimbada

Supervised Fine-Tuning has a ceiling: cross-entropy loss teaches token imitation rather than preference, leaving models repetitive and brittle in the face of ambiguity. This piece covers two methods that fix it: DPO and GRPO. DPO uses preference pairs with a frozen reference policy to rank outputs rather than mimic them, making it the standard choice for subjective tasks. GRPO eliminates the critic model with group-scored rollouts and deterministic rewards, cutting PPO’s memory overhead roughly in half and outperforming DPO on tasks with verifiable ground truth.

2. Implementing Agentic RAG on Azure From Hand-Coded Code to Ready-to-Use Solutions By Chris Bao

Agentic RAG goes beyond linear retrieval by embedding planning, reflection, and tool-calling into the RAG pipeline. This guide walks you through building a working system on Azure AI Search and Foundry Agent Service, covering query rewriting, hybrid search, LLM-as-a-judge evaluation using groundedness and relevance scores, and an iterative self-correction loop that triggers re-searches or Bing web searches when answer quality falls short. The piece also contrasts this hand-coded complexity with Azure AI Search’s native agentic retrieval mode, which automatically handles query decomposition and multi-source merging.

3. Claude Code: How to Build, Evaluate, and Tune AI Agent Skills By Rick Hightower

Claude Code Skills are SKILL.md files that extend Claude’s behavior for specific workflows, and this article breaks down how to build, evaluate, and maintain them effectively. It distinguishes two types: Capability Uplift skills, which teach better reasoning but age out as models improve, and Encoded Preference skills, which capture irreplaceable team workflows and compound in value over time. The Skill Creator handles creation, benchmarking, and trigger tuning, solving the false-trigger problem that plagues libraries with ten or more overlapping skills.

4. Claude Code Subagents and Main-Agent Coordination: A Complete Guide to AI Agent Delegation Patterns By Rick Hightower

Claude Code delegates work by spawning specialized subagents, each running in its own isolated context window with restricted tool access and a custom system prompt. This article covers four built-in subagent types: Explore, Plan, General-purpose, and Bash, along with a five-step delegation flow where only final summaries return to the main agent. The hub-and-spoke coordination model keeps debugging tractable, JSONL transcripts provide full auditability, and custom agents defined via YAML frontmatter let teams encode specialized workflows into reusable, shareable definitions.

If you are interested in publishing with Towards AI, check our guidelines and sign up. We will publish your work to our network if it meets our editorial policies and standards.

LAI #123: Claude Code’s Codebase Was Accidentally Leaked was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.