Jyotin Goel, Souvik Maji, Pratik Mazumder

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Jyotin Goel, Souvik Maji, Pratik Mazumder / May 12, 2026

arXiv:2602.17546v2 Announce Type: replace
Abstract: Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses of…

Author name: Jyotin Goel, Souvik Maji, Pratik Mazumder

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning