Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka, Ivan Flechais

Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka, Ivan Flechais / May 7, 2026

arXiv:2605.04454v1 Announce Type: cross
Abstract: Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise…

Author name: Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka, Ivan Flechais

Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone