A Unified Perturbation Framework for Analyzing Leaderboard Stability and Manipulation
arXiv:2605.15761v1 Announce Type: new
Abstract: Evaluation leaderboards such as LMArena play a central role in benchmarking large language models by aggregating pairwise human preferences into model rankings, yet the robustness of these rankings remai…