cs.AI, cs.CL, cs.LG

Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers

arXiv:2602.01442v3 Announce Type: replace-cross
Abstract: Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evalua…