Nilanjana Das, Manas Gaur

Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings

Nilanjana Das, Manas Gaur / April 28, 2026

arXiv:2604.23130v1 Announce Type: cross
Abstract: Large language models (LLMs) can still be jailbroken into producing harmful outputs despite safety alignment. Existing attacks show this vulnerability, but not the internal mechanisms that cause it. Th…

Author name: Nilanjana Das, Manas Gaur

Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings