cs.CL, cs.LG

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

arXiv:2602.17546v2 Announce Type: replace
Abstract: Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses of…