Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers
arXiv:2602.01442v3 Announce Type: replace-cross
Abstract: Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evalua…