Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
arXiv:2604.12384v1 Announce Type: new
Abstract: Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing d…