cs.CL, cs.LG

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

arXiv:2604.16380v1 Announce Type: new
Abstract: Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under rea…