Interpretable and Steerable Concept Bottleneck Sparse Autoencoders
arXiv:2512.10805v2 Announce Type: replace-cross
Abstract: Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires learn…