OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training
arXiv:2603.28858v2 Announce Type: replace-cross
Abstract: Continual pre-training is widely used to adapt LLMs to target languages and domains, yet the mixture ratio of training data remains a sensitive hyperparameter that is expensive to tune: they mu…