SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
arXiv:2604.17691v1 Announce Type: new
Abstract: Safety alignment in large language models is remarkably shallow: it is concentrated in the first few output tokens and reversible by fine-tuning on as few as 100 adversarial examples. This fragility beco…