The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice
arXiv:2605.01311v1 Announce Type: new
Abstract: Offline evaluation of language models from usage logs is biased when model choice is confounded: the same user-side factors that influence which model is used can also influence how its output is judged,…