Arnau Marin-Llobet, Javier Ferrando

Automated Interpretability and Feature Discovery in Language Models with Agents

Arnau Marin-Llobet, Javier Ferrando / May 5, 2026

arXiv:2605.01555v1 Announce Type: new
Abstract: We introduce an autonomous multiagent framework for mechanistic interpretability that automates both explaining and finding internal features in large language models. The system runs two coupled loops: …

Author name: Arnau Marin-Llobet, Javier Ferrando

Automated Interpretability and Feature Discovery in Language Models with Agents