Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal

Towards Understanding the Robustness of Sparse Autoencoders

Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal / April 22, 2026

arXiv:2604.18756v1 Announce Type: new
Abstract: Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, the…

Author name: Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal

Towards Understanding the Robustness of Sparse Autoencoders