Applying Karpathy's autoresearch to a 33M-token public transit dataset (14% improvement, replication notes) [P]

Hello r/MachineLearning! I work in the US transit industry and I went all-in on learning AI & ML a few months ago. When I heard about Andrej Karpathy's autoresearch framework, I thought it was really cool.

I decided to use the same transit dataset from an earlier GPT-2 XL fine-tuning project to train a small 80M model from scratch. Autoresearch is designed for from-scratch pretraining (not fine-tuning) so I started a new project rather than retrofitting the GPT-2 XL one.

I would love to hear from you …

Where did I mess up?
What’s interesting here?
What should I focus on learning? What do I do next? (I have some thoughts at end of post)

Why did I do this?

My understanding is that Karpathy's autoresearch framework is an LLM-driven research loop: an agent edits a single training script, runs a 5-minute training experiment on a fixed dataset, and commits or reverts based on a single scalar metric. It was designed and tested on FineWeb (effectively, an infinite web-scale text). However, my model is industry-specific and wayyy smaller data set.

In reviewing Karpathy’s wiki, I explored whether its core mechanics (such as the autonomous experiment loop, the 5-min training limit, and the single-scalar pass/fail ratchet) still produce significant perplexity reductions with limited data. So, I forked autoresearch, pointed it at a small transit-data corpus (~ 33 million tokens including traffic analysis, train plans, and regulatory Q&A pairs), and set out to answer two main questions:

Question #1 Does autoresearch work on a corpus six orders of magnitude smaller than its design target?

Question #2: What does the autoresearch agent find that I wouldn't have proposed?

To be clear, the output was intended as a methodology validation, not a deployable chatbot. I wanted to know whether the framework's pattern (autonomous overnight experiments, single-scalar ratchet, git-as-tracker) holds up when the data is small and specialized.

My Project constraints

Hardware: a single RTX 5080 (16 GB, sm120 … Blackwell's consumer architecture) under WSL2 Ubuntu 22.04. No cloud gpus.
Budget per experiment: 5 minutes of training (the wall-clock contract autoresearch enforces).
No new dependencies: only what shipped in pyproject.toml.
From-scratch only: no pretrained base. The agent trained a transformer from random initialization on each 5-min experiment. (This is distinct from the LoRA fine-tuning of GPT-2 XL I'd done earlier on the same corpus. That model isn't in scope for this project. Comparing the two approaches is one of the possible next steps at the bottom of the post.)

My Design choices and why

Early on, I came across a few Challenges. The autoresearch framework makes three assumptions that didn’t seem to hold for my experiment: that FlashAttention-3 kernels are available on the GPU, that the agent's "one change per experiment" rule can be honored with the existing architecture controls, and that the held-out data is big enough to resist adaptive overfitting. None of those held in my setup. Each of which is addressed below.

SDPA-only attention: My RTX 5080 GPU doesn't support the FlashAttention-3 kernels that autoresearch's default expects, so I switched to PyTorch's built-in attention (scaled_dot_product_attention with the cuDNN backend). This is permanent until FlashAttention-3 ships support for Blackwell GPUs.
Two atomic scaling knobs. Karpathy's train.py controls model architecture through several constants that depend on each other — changing model size means editing several lines at once, which breaks the agent's "one change per experiment" rule. I replaced those with two single-line knobs: TARGET_PARAMS_M (total parameters) and ASPECT_RATIO (depth-vs-width shape), with a helper function (derive_arch()) handling the bookkeeping. Frustrating at first because the agent loses fine control, but it forced every experiment to be a clean apples-to-apples comparison.
Hidden-gate Ladder protocol: The agent never sees the held-out validation score directly — only a pass/fail signal plus a 4-bucket margin (clear / narrow / miss / first_run). The exact score goes to a private file the agent isn't allowed to read, so it can't tune toward a number it can't see.

A few more pivots: I split the transit corpus into four parts (train, dev, val_public, test_private), grouping by topic so no document spans the boundary between any two parts — this prevents leakage between training, the agent's working data, the commit-gate data, and the data we hold back for milestone checks. The tokenizer is custom-built so 65 high-frequency transit acronyms (FTA, MBTA, NTD, IIJA, etc.) each encode as a single token instead of getting split into subword fragments. And before the agent loop ran, I trained the same baseline five times with different random seeds to measure how much each score swings from random luck — that gave me a noise floor for telling real improvements from random variation later on.

https://preview.redd.it/5b0ndl0lfdyg1.png?width=1600&format=png&auto=webp&s=db3ae95071d544910337d169696ecaa622a89907

Key findings

The biggest single change seemed counterintuitive to me at first. The agent halved the batch size twice — from 524K tokens per training step down to 131K — fitting 3.6× more training updates into the same 5-minute budget (118 training updates ---> 427 training updates). Only the number of updates went up, with noisier signal in each one, and the Muon optimizer handled the noise without breaking. I would have rejected this in code review on the conventional "bigger batches train more reliably" advice; the agent didn't share that bias and found it on experiment 13, after eight failed architectural attempts.

The Model size curve (below) settled the size question. 80M parameters was the clean peak; 30M and 50M lacked capacity, while 100M and 150M couldn't train enough optimizer steps in 5 minutes to compete (150M only ran for 84 steps before time ran out).

https://preview.redd.it/wdstkdzmfdyg1.png?width=1600&format=png&auto=webp&s=13e504c48ff7dee4501af84c700b4f0e37c8807f

The methodology layer identified two false positives. Two experiments improved the agent's working metric (dev_bpb) but did not apply to the held-out surface (val_public_bpb). Without the hidden-gate, both would have made errors; instead, both reverted.

https://preview.redd.it/y8kr4wrpfdyg1.png?width=1600&format=png&auto=webp&s=4d8c04cc63c2578f9703029db4f07d38e9f91048

Then my rigor pass humbled me quite a bit. When I replicated the late-stage "winners" at a different random seed (INIT_SEED=43), the language-modeling result held rock-solid (Δ within ±0.005 across four runs, two architectures × two seeds), but two apparent accuracy improvements collapsed: Terminology accuracy swung 9 percentage points between seeds and Regulatory citation accuracy swung 15% points.

A proper statistical test on the accuracy benchmarks (terminology, Q&A, regulatory citation) showed that only 1 of 8 head-to-head comparisons was statistically significant. The conclusion was unavoidable: the language-modeling improvement is real (validated separately, ~20x above noise and replicated at a fresh seed), but the apparent domain-accuracy "wins" turned out to be noise at our 100-250-item benchmark sizes.

https://preview.redd.it/9bybtswrfdyg1.png?width=1600&format=png&auto=webp&s=8061104f818c9868f76cdf32336926c6867227b3

Key learnings

Five lessons from this project I plan to carry into any autoresearch-on-small-data follow-up:

The autoresearch framework works on small, specialized data. But you have to add your own safety net. The Transit Language Model score did improve by ~14%. However, 2 of the experiments looked like wins but didn't actually generalize to data it wasn't allowed to see. Without proper guardrails, false positives can still be shipped.
The biggest win came from changing how often the model updates, not what it looks like. Halving the batch size twice fit 3.6× more training updates into the same 5-minute budget (118 training updates → 427 training updates) and drove the 13.8% improvement. I would have rejected this change in code review as I was of the mindset that "bigger batches train more reliably." The autoresearch agent didn't share that bias, and the Muon optimizer was robust enough to handle the noisier updates without breaking.
Train the baseline a few times with different random seeds before letting the agent run. Five baselines, ~30 minutes — and you know how much each metric swings from luck alone. Without that, you can't tell signal from noise.
Re-run every win at a different random seed before completing the run. Two ~6-min reruns showed that two of the late-stage accuracy "wins" didn't replicate. They were lucky seed picks, not real improvements. 2 seeds seemed like enough to start flagging noise.
Don't let the agent see the held-out score directly — only a pass/fail signal. The agent can't game what it can't see. This caught two would-be "wins" during the project that wouldn't have generalized to new data.

Next steps

Honestly, I'm not sure where to go from here. There’s a few directions that all feel worth pursuing, and I'd love input from the ML community on which is most interesting. The three I'm weighing:

1. Replicate the project at fresh random seeds. Re-run the full Phase 5 + Phase 7 pipeline at two or three new seeds to see whether the same wins (or close results) emerge … and whether the same false positives recur. I want to know "is the methodology repeatable, or did I get lucky in a different way?"

2. Run autoresearch "by the book" on a general-purpose corpus. Clone Karpathy's main repo without my AutoTransit changes and test it on a chunk of FineWeb, which is what the framework is designed for. Comparing the results here to those on my small, specialized dataset will show what findings are general about autoresearch and what are specific to small data.

3. Compare what I did from scratch to domain-adaptive pretraining (DAPT). I would use a similarly sized pretrained model off the shelf—Pythia-160M, already trained on web text—and continue training it on my transit dataset. Keep the same data, eval method, and approach. The main question is whether starting from random weights can compete with the obvious shortcut—most research says it shouldn’t from what I gather. If my from-scratch result holds up, that's the interesting part; if not, I’d still learn something useful.

THANK YOU if you’ve read or scrolled this far!! Lol. Please share your thoughts …. Where’d I mess up? What’s interesting? What should I consider doing next?

submitted by /u/MarsPassenger
[link] [comments]

Applying Karpathy’s autoresearch to a 33M-token public transit dataset (14% improvement, replication notes) [P]