Tung-Ling Li, Hongliang Liu

Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness

Tung-Ling Li, Hongliang Liu / May 5, 2026

arXiv:2506.24056v2 Announce Type: replace-cross
Abstract: RLHF-style alignment trains language models to refuse unsafe requests, but how much operational margin does this refusal rest on? We introduce the refusal-affirmation logit gap: the difference …

Author name: Tung-Ling Li, Hongliang Liu

Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness