Automating Interpretability with Agents
This work was produced as part of the SPAR Program – Fall 2025 Cohort, with support from Georg Lange.Kseniya Parkhamchuk, Jack PayneTL;DRAutomated feature explanations from Delphi fail 38% of the time. The failures are sensitivity issues, output featur…