cs.AI, cs.LG

Improving Robustness In Sparse Autoencoders via Masked Regularization

arXiv:2604.06495v1 Announce Type: cross
Abstract: Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone is an imperfect proxy for interpretability, and …