Michael A. Riegler, Birk Sebastian Frostelid Torpmann-Hagen

Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

Michael A. Riegler, Birk Sebastian Frostelid Torpmann-Hagen / May 6, 2026

arXiv:2605.03160v1 Announce Type: new
Abstract: The standard sparse-autoencoder (SAE) interpretability protocol labels each feature from its top-activating contexts and validates by single-feature steering. We propose the pairwise matrix protocol, co-…

Author name: Michael A. Riegler, Birk Sebastian Frostelid Torpmann-Hagen

Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes