Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer
arXiv:2604.24302v1 Announce Type: new
Abstract: Mechanistic interpretability has made it possible to localize circuits underlying specific behaviors in language models, but existing methods are expensive, model-specific, and difficult to scale to larg…