Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection
arXiv:2602.07892v2 Announce Type: replace-cross
Abstract: Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the \emph{alignme…