Rohith Reddy Bellibatlu, Edward Raff, Wenbin Zhang

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

Rohith Reddy Bellibatlu, Edward Raff, Wenbin Zhang / May 11, 2026

arXiv:2604.23478v2 Announce Type: replace
Abstract: Large language models are widely adopted as automated evaluation judges, yet the stability of their verdicts under semantically equivalent prompt rephrasings remains largely unexamined. We conduct a …

Author name: Rohith Reddy Bellibatlu, Edward Raff, Wenbin Zhang

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems