cs.LG, econ.EM, stat.AP, stat.ML

The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

arXiv:2605.01311v1 Announce Type: new
Abstract: Offline evaluation of language models from usage logs is biased when model choice is confounded: the same user-side factors that influence which model is used can also influence how its output is judged,…