Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
arXiv:2605.02105v1 Announce Type: new
Abstract: Pretraining optimizers are tuned to produce the strongest possible base model, on the assumption that a stronger starting point yields a stronger model after subsequent changes like post-training and qua…