Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling
arXiv:2604.28075v2 Announce Type: replace-cross
Abstract: Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for high-resource non-English languages like …