Pando: Do Interpretability Methods Work When Models Won’t Explain Themselves?
arXiv:2604.11061v1 Announce Type: new
Abstract: Mechanistic interpretability is often motivated for alignment auditing, where a model’s verbal explanations can be absent, incomplete, or misleading. Yet many evaluations do not control whether black-box…