Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes
arXiv:2605.03160v1 Announce Type: new
Abstract: The standard sparse-autoencoder (SAE) interpretability protocol labels each feature from its top-activating contexts and validates by single-feature steering. We propose the pairwise matrix protocol, co-…