Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness
arXiv:2509.13332v2 Announce Type: replace
Abstract: As Large Language Models (LLMs) are increasingly adopted as automated judges in benchmarking and reward modeling, ensuring their reliability, efficiency, and robustness has become critical. In this w…