Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining
arXiv:2603.24826v1 Announce Type: new
Abstract: Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for…