Pando: A Controlled Benchmark for Interpretability Methods
TL;DR: Pando is a new interpretability benchmark with 720+ fine-tuned LLMs carrying known decision rules and varying rationale faithfulness. We find gradient-based methods outperform blackbox baselines; non-gradient methods struggle. This post discusse…