Improving Robustness In Sparse Autoencoders via Masked Regularization
arXiv:2604.06495v1 Announce Type: cross
Abstract: Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone is an imperfect proxy for interpretability, and …