Bias and Uncertainty in LLM-as-a-Judge Estimation
arXiv:2605.06939v1 Announce Type: new
Abstract: LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. …