Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness
arXiv:2506.24056v2 Announce Type: replace-cross
Abstract: RLHF-style alignment trains language models to refuse unsafe requests, but how much operational margin does this refusal rest on? We introduce the refusal-affirmation logit gap: the difference …