Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features
arXiv:2605.12874v1 Announce Type: new
Abstract: Sparse autoencoders (SAEs) are now standard tools for decomposing language model activations into interpretable features, and automated interpretability pipelines routinely assign each feature a short na…