LAI #125: Karpathy’s Agent Ran 700 Experiments Without Him

The Context Rut, plus vectorless RAG, why attention is kernel evaluation, and the end of XGBoost’s decade

Good morning, AI enthusiasts!

An AI agent just ran 700 experiments on its own, found patterns, and optimized its own performance, no human in the loop. This week, I break down Karpathy’s Auto Research project and the technical bottleneck it exposes: the Context Rut. We also cover why putting business logic inside a prompt is one of the most common mistakes in production LLM systems and what to do instead.

We also cover:

How to deploy a full Snowflake Cortex AI dashboard from a single SQL worksheet
A 70-year-old theorem that proves every attention score you’ve ever computed is a kernel evaluation, and why softmax is a mathematical necessity.
RAG without a single vector: PageIndex replaces embeddings with a reasoning-driven tree index and scored 98.7% on FinanceBench.
The three math ideas that make or break your understanding of backpropagation: derivatives, chain rule, and log-loss.
Why XGBoost’s real bottleneck was never the model, it was the 70–80% of project time spent flattening relational databases into matrices. Relational Foundation Models skip that layer entirely.

Let’s get into it!

What’s AI Weekly

This week in What’s AI, I break down the architecture behind this “Auto Research” agent and the massive technical hurdle in building agents: The Context Rut. Andrej Karpathy released a project that ran 700 experiments entirely on its own, identifying patterns and optimizing its own performance without human intervention. Which leads to an important question: Are we seeing the first real-world loop of AI making AI better? I answer this and share a few more technical strategies to keep your agentic frameworks lean and efficient. Watch the full video on YouTube.

AI Tip of the Day

If your business logic only exists inside a prompt, you cannot test it, audit it, or guarantee it runs the same way twice. Prompts like “only approve refunds under $50” look like rules, but they are suggestions the model can misinterpret, ignore under edge cases, or lose entirely to a prompt injection.

A better approach is to keep product rules in normal code. The model should extract intent, classify inputs, and generate responses. Your backend should enforce limits, check eligibility, validate account state, and gate irreversible actions.

For example, the model can extract a refund reason and suggest whether the user is eligible for a refund. But the backend should check the actual purchase history, policy rules, and account state before anything happens.

We cover this pattern and the broader architecture decisions behind production LLM systems in our Full Stack AI Engineering course.

— Louis-François Bouchard, Towards AI Co-founder & Head of Community

Learn AI Together Community Section!

Featured Community post from the Discord

Colignum created LACK, a lightweight, self‑hosted multi‑agent chat platform powered by local LLMs (Ollama). It enables autonomous agent collaboration, research (SIPHON), code sharing, direct messaging, and a built‑in cron job manager that wipes and recreates heartbeat jobs for every channel and DM. Check it out on GitHub and support a fellow community member. If you have feedback, share it in the channel.

AI poll of the week!

This looks less like a “which model wins” poll and more like a tool stack snapshot. Opus leads, but the comments make it clear most people aren’t committing to one model; they’re routing.

Claude (especially Opus) shows up consistently for coding and structured work, GPT for general use and brainstorming, and Gemini for a second opinion or for tasks like search, explanation, and media. Even the tradeoffs are consistent: Claude is strong but token-heavy, GPT is reliable for everyday use, and Gemini is surprisingly good when you need breadth or external context. The interesting part isn’t who’s best, it’s that people are building workflows across models instead of betting on one.

If you had to remove one model from your stack today, which one would actually break your workflow the most, and what specific task would you struggle to replace? Let’s talk in the thread!

Collaboration Opportunities

The Learn AI Together Discord community is flooding with collaboration opportunities. If you are excited to dive into applied AI, want a study partner, or even want to find a partner for your passion project, join the collaboration channel! Keep an eye on this section, too — we share cool opportunities every week!

1. Knightinout is looking for a collaborator to discuss generative AI, AI agents, and Machine Learning, including mathematical foundations, over the summer. The goal is to quickly understand the landscape and compound efforts to get certifications faster. If this sounds like a good way to spend the summer, connect with them in the thread!

2. Amanray9414 is trying to learn the blend of agentic AI and RL agents and wants to learn further about autonomous agents. If that sounds interesting, reach out to them in the thread!

3. Beratgurleer is experimenting with no-code automations and AI-based projects and is looking for a business partner. The agency will be based on n8n, and you’ll learn, brainstorm, develop products together, and solve problems. If this sounds fun, contact them in the thread!

Meme of the week!

Meme shared by drdub_

TAI Curated Section

Article of the week

Building the Agentic Enterprise Control Plane on Snowflake By Satish Kumar

This article walks through the process of building a production-grade Snowflake Cortex AI dashboard, deployed entirely from a single SQL worksheet. The setup covers five enterprise tables, 70 synthetic records, and six Cortex functions, including sentiment analysis, summarization, classification, and LLM inference via COMPLETE. It also uses Python stored procedures with chr() substitution to avoid SQL parser conflicts, writing the full Streamlit app line by line directly to stage, with no external files, no manual steps, and built-in version control through an APP_SOURCE table.

Our must-read articles

1. Every Attention Score You Have Ever Computed Is a Kernel Evaluation By Dr. Swarnendu AI

A 70-year-old theorem, Mercer’s theorem, proves that attention scores in transformers are kernel evaluations. The piece explains the mathematics connecting dot-product attention to Reproducing Kernel Hilbert Spaces, showing that softmax is not an architectural choice but a mathematical necessity: the only normalization that simultaneously satisfies non-negativity, unit-sum, differentiability, and score amplification. It also unifies SVMs, Gaussian Processes, and transformers into a single equation, reframing attention as Nadaraya-Watson kernel regression in a learned representation space.

2. Vectorless RAG with PageIndex [+ Implementation] By Asad Iqbal

PageIndex reframes document retrieval by replacing embeddings with a reasoning-driven tree index that mirrors how human analysts actually navigate complex reports. Rather than approximate similarity matching, an LLM traverses a hierarchical JSON structure to locate exactly the right sections, preserving cross-references, table relationships, and the document hierarchy. Every retrieval decision remains fully traceable, making it audit-ready for regulated environments. Mafin 2.5, built on PageIndex, scored 98.7% on FinanceBench. The implementation walkthrough covers the full three-step pipeline using the DeepSeek-R1 paper, Groq, and the PageIndex SDK.

3. Ignore These 3 Math Ideas: And Backpropagation Will Never Make Sense By Tina Sharma

This article breaks down backpropagation into three foundational pieces: derivatives, the chain rule, and log-loss. It shows how derivatives measure a weight’s sensitivity to error, how the chain rule propagates that signal backward through every layer by multiplying local gradients, and how negative log-probability gives training a principled, information-theoretic target to minimize. Each concept carries real design trade-offs, from vanishing gradients to loss spikes, and the piece reframes all three as part of the vocabulary rather than barriers.

4. Is XGBoost gone: How Relational Foundation Models Conquered 500 Billion Row Enterprise Data By Ampatishan Sivalingam

XGBoost ruled enterprise ML for a decade, but its real bottleneck was the engineering machinery required to flatten relational databases into matrices it could read. Feature stores, Airflow DAGs, and aggregation pipelines consumed 70–80% of ML project time while destroying sequential signals in the process. This article introduces Relational Foundation Models, which eliminate that entire layer by ingesting raw database schemas directly and using schema-aware attention over foreign keys, as well as state-space models for temporal sequences.

If you are interested in publishing with Towards AI, check our guidelines and sign up. We will publish your work to our network if it meets our editorial policies and standards.

LAI #125: Karpathy’s Agent Ran 700 Experiments Without Him was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.