cs.CL, cs.LG

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

arXiv:2602.07892v2 Announce Type: replace-cross
Abstract: Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the \emph{alignme…