Quintin Pope, Ajay Hayagreeve Balaji, Jacques Thibodeau, Xiaoli Fern

Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models

Quintin Pope, Ajay Hayagreeve Balaji, Jacques Thibodeau, Xiaoli Fern / May 7, 2026

arXiv:2605.05090v1 Announce Type: new
Abstract: We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models. Given a base model $M_1$ and an intervention model $M_2$, our method…

Author name: Quintin Pope, Ajay Hayagreeve Balaji, Jacques Thibodeau, Xiaoli Fern

Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models