| Hello r/MachineLearning! I work in the US transit industry and I went all-in on learning AI & ML a few months ago. When I heard about Andrej Karpathy's autoresearch framework, I thought it was really cool. I decided to use the same transit dataset from an earlier GPT-2 XL fine-tuning project to train a small 80M model from scratch. Autoresearch is designed for from-scratch pretraining (not fine-tuning) so I started a new project rather than retrofitting the GPT-2 XL one. I would love to hear from you …
Why did I do this?My understanding is that Karpathy's autoresearch framework is an LLM-driven research loop: an agent edits a single training script, runs a 5-minute training experiment on a fixed dataset, and commits or reverts based on a single scalar metric. It was designed and tested on FineWeb (effectively, an infinite web-scale text). However, my model is industry-specific and wayyy smaller data set. In reviewing Karpathy’s wiki, I explored whether its core mechanics (such as the autonomous experiment loop, the 5-min training limit, and the single-scalar pass/fail ratchet) still produce significant perplexity reductions with limited data. So, I forked autoresearch, pointed it at a small transit-data corpus (~ 33 million tokens including traffic analysis, train plans, and regulatory Q&A pairs), and set out to answer two main questions: Question #1 Does autoresearch work on a corpus six orders of magnitude smaller than its design target? Question #2: What does the autoresearch agent find that I wouldn't have proposed? To be clear, the output was intended as a methodology validation, not a deployable chatbot. I wanted to know whether the framework's pattern (autonomous overnight experiments, single-scalar ratchet, git-as-tracker) holds up when the data is small and specialized. My Project constraints
My Design choices and whyEarly on, I came across a few Challenges. The autoresearch framework makes three assumptions that didn’t seem to hold for my experiment: that FlashAttention-3 kernels are available on the GPU, that the agent's "one change per experiment" rule can be honored with the existing architecture controls, and that the held-out data is big enough to resist adaptive overfitting. None of those held in my setup. Each of which is addressed below.
A few more pivots: I split the transit corpus into four parts (train, dev, val_public, test_private), grouping by topic so no document spans the boundary between any two parts — this prevents leakage between training, the agent's working data, the commit-gate data, and the data we hold back for milestone checks. The tokenizer is custom-built so 65 high-frequency transit acronyms (FTA, MBTA, NTD, IIJA, etc.) each encode as a single token instead of getting split into subword fragments. And before the agent loop ran, I trained the same baseline five times with different random seeds to measure how much each score swings from random luck — that gave me a noise floor for telling real improvements from random variation later on. Key findingsThe biggest single change seemed counterintuitive to me at first. The agent halved the batch size twice — from 524K tokens per training step down to 131K — fitting 3.6× more training updates into the same 5-minute budget (118 training updates ---> 427 training updates). Only the number of updates went up, with noisier signal in each one, and the Muon optimizer handled the noise without breaking. I would have rejected this in code review on the conventional "bigger batches train more reliably" advice; the agent didn't share that bias and found it on experiment 13, after eight failed architectural attempts. The Model size curve (below) settled the size question. 80M parameters was the clean peak; 30M and 50M lacked capacity, while 100M and 150M couldn't train enough optimizer steps in 5 minutes to compete (150M only ran for 84 steps before time ran out). The methodology layer identified two false positives. Two experiments improved the agent's working metric (dev_bpb) but did not apply to the held-out surface (val_public_bpb). Without the hidden-gate, both would have made errors; instead, both reverted. Then my rigor pass humbled me quite a bit. When I replicated the late-stage "winners" at a different random seed (INIT_SEED=43), the language-modeling result held rock-solid (Δ within ±0.005 across four runs, two architectures × two seeds), but two apparent accuracy improvements collapsed: Terminology accuracy swung 9 percentage points between seeds and Regulatory citation accuracy swung 15% points. A proper statistical test on the accuracy benchmarks (terminology, Q&A, regulatory citation) showed that only 1 of 8 head-to-head comparisons was statistically significant. The conclusion was unavoidable: the language-modeling improvement is real (validated separately, ~20x above noise and replicated at a fresh seed), but the apparent domain-accuracy "wins" turned out to be noise at our 100-250-item benchmark sizes.
Key learningsFive lessons from this project I plan to carry into any autoresearch-on-small-data follow-up:
Next stepsHonestly, I'm not sure where to go from here. There’s a few directions that all feel worth pursuing, and I'd love input from the ML community on which is most interesting. The three I'm weighing: 1. Replicate the project at fresh random seeds. Re-run the full Phase 5 + Phase 7 pipeline at two or three new seeds to see whether the same wins (or close results) emerge … and whether the same false positives recur. I want to know "is the methodology repeatable, or did I get lucky in a different way?" 2. Run autoresearch "by the book" on a general-purpose corpus. Clone Karpathy's main repo without my AutoTransit changes and test it on a chunk of FineWeb, which is what the framework is designed for. Comparing the results here to those on my small, specialized dataset will show what findings are general about autoresearch and what are specific to small data. 3. Compare what I did from scratch to domain-adaptive pretraining (DAPT). I would use a similarly sized pretrained model off the shelf—Pythia-160M, already trained on web text—and continue training it on my transit dataset. Keep the same data, eval method, and approach. The main question is whether starting from random weights can compete with the obvious shortcut—most research says it shouldn’t from what I gather. If my from-scratch result holds up, that's the interesting part; if not, I’d still learn something useful.
THANK YOU if you’ve read or scrolled this far!! Lol. Please share your thoughts …. Where’d I mess up? What’s interesting? What should I consider doing next?
[link] [comments] |