Calibrate, Don’t Curate: Label-Efficient Estimation from Noisy LLM Judges
arXiv:2605.09702v1 Announce Type: cross
Abstract: Multi-judge evaluation is increasingly used to assess LLMs and reward models, and the prevailing heuristic is to curate: keep the most accurate judges and discard weaker ones. We show that this heurist…