Attention Is Where You Attack
arXiv:2605.00236v1 Announce Type: cross
Abstract: Safety-aligned large language models rely on RLHF and instruction tuning to refuse harmful requests, yet the internal mechanisms implementing safety behavior remain poorly understood. We introduce the …