cs.AI, cs.CL

Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models

arXiv:2605.05090v1 Announce Type: new
Abstract: We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models. Given a base model $M_1$ and an intervention model $M_2$, our method…