Satchel Grant, Simon Jerome Han, Alexa R. Tartaglini, Christopher Potts

Addressing divergent representations from causal interventions on neural networks

Satchel Grant, Simon Jerome Han, Alexa R. Tartaglini, Christopher Potts / April 24, 2026

arXiv:2511.04638v5 Announce Type: replace-cross
Abstract: A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we as…

Author name: Satchel Grant, Simon Jerome Han, Alexa R. Tartaglini, Christopher Potts

Addressing divergent representations from causal interventions on neural networks