Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection
arXiv:2604.20549v1 Announce Type: new
Abstract: As Large Language Models (LLMs) scale, data curation has shifted from maximizing volume to optimizing the signal-to-noise ratio by performing quality filtering. However, for many languages, native high q…