Attributions All the Way Down? The Metagame of Interpretability
arXiv:2605.06295v1 Announce Type: cross
Abstract: We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $\phi(f)$ explaining a model $f$, we measure th…