cs.AI, cs.CR

Attention Is Where You Attack

arXiv:2605.00236v1 Announce Type: cross
Abstract: Safety-aligned large language models rely on RLHF and instruction tuning to refuse harmful requests, yet the internal mechanisms implementing safety behavior remain poorly understood. We introduce the …