Guardrails in Logit Space: Safety Token Regularization for LLM Alignment
arXiv:2604.17210v1 Announce Type: new
Abstract: Fine-tuning well-aligned large language models (LLMs) on new domains often degrades their safety alignment, even when using benign datasets. Existing safety alignment techniques primarily focus on pretra…