Mechanistic Interpretability with Sparse Autoencoder Neural Operators
arXiv:2509.03738v4 Announce Type: replace-cross
Abstract: We introduce sparse autoencoder neural operators (SAE-NOs), a new class of sparse autoencoders that operate in function spaces rather than fixed-dimensional Euclidean representations. We formalize the functional representation hypothesis, where data are explained through sparse compositions of structured functions. Unlike standard SAEs that represent concepts with scalar activations, SAE-NOs parameterize concepts as functions, enabling representations that capture not only a concept's presence, but also how and where it is expressed across the input domain. We achieve this through joint sparsity: concept sparsity selects active concepts, while domain sparsity governs where they are expressed. We instantiate this framework using Fourier neural operators (SAE-FNOs), parameterizing concepts as integral operators in the Fourier domain. This functional and spectral parameterization is particularly advantageous when data exhibit spatial structure across scales or when concepts are frequency-structured. We characterize SAE-FNO on vision data and demonstrate that it learns localized patterns, uses concepts more efficiently, and exhibits stable concept characteristics across sparsity levels. We further show that SAE-FNO adapts to changes in domain size and generalizes across discretizations, operating at resolutions beyond those seen during training, where standard SAEs fail. We also introduce lifting into SAEs and show theoretically and empirically that it acts as a preconditioner that accelerates optimization. Overall, our results show that moving from vector-valued to functional parameterizations, with concept and domain sparsity, extends SAEs from representing concept presence to modeling structured concept expression, highlighting the importance of parameterization.