Utsav Maskey, Mark Dras, Usman Naseem

Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs

Utsav Maskey, Mark Dras, Usman Naseem / March 31, 2026

arXiv:2603.27518v1 Announce Type: new
Abstract: Aligned language models that are trained to refuse harmful requests also exhibit over-refusal: they decline safe instructions that seemingly resemble harmful instructions. A natural approach is to ablate…

Author name: Utsav Maskey, Mark Dras, Usman Naseem

Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs