cs.LG, stat.ME, stat.ML

Bias and Uncertainty in LLM-as-a-Judge Estimation

arXiv:2605.06939v1 Announce Type: new
Abstract: LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. …