Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice
arXiv:2512.24503v2 Announce Type: replace
Abstract: Data teams at frontier AI companies routinely train small proxy models to make critical decisions about pretraining data recipes for full-scale training runs. However, the community has a limited und…