Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data
arXiv:2605.10129v1 Announce Type: new
Abstract: Large language models (LLMs) rely on web-scale corpora for pre-training. The noise inherent in these datasets tends to obscure meaningful patterns and ultimately degrade model performance. Data curation …