cs.CL, cs.CR, cs.LG

Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness

arXiv:2506.24056v2 Announce Type: replace-cross
Abstract: RLHF-style alignment trains language models to refuse unsafe requests, but how much operational margin does this refusal rest on? We introduce the refusal-affirmation logit gap: the difference …