cs.AI, cs.CL, cs.HC

Automated Interpretability and Feature Discovery in Language Models with Agents

arXiv:2605.01555v1 Announce Type: new
Abstract: We introduce an autonomous multiagent framework for mechanistic interpretability that automates both explaining and finding internal features in large language models. The system runs two coupled loops: …