Data Mixing for Large Language Models Pretraining: A Survey and Outlook
arXiv:2604.16380v1 Announce Type: new
Abstract: Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under rea…