Constructing Interpretable Features from Compositional Neuron Groups
arXiv:2506.10920v2 Announce Type: replace-cross
Abstract: A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on…