Aviral Srivastava, Sourav Panda

Attention Is Where You Attack

Aviral Srivastava, Sourav Panda / May 5, 2026

arXiv:2605.00236v1 Announce Type: cross
Abstract: Safety-aligned large language models rely on RLHF and instruction tuning to refuse harmful requests, yet the internal mechanisms implementing safety behavior remain poorly understood. We introduce the …

Author name: Aviral Srivastava, Sourav Panda

Attention Is Where You Attack