I see a lot of people trying to justify the usage of the LLM judges, including Neurips.
Well, tbh PAT was impressive tho. However all the available LLM, which i tested on the highest version of Claude, Gemni and GPT are all trashed if the paper is a hard one - etc, if the theoretical aspects are harsh. It seems that it is working okish if the paper is good.
The point is, i'm kind of afraid of reviewers relying on the LLM judges and say gibberish, for the upcoming Neurips and potential venues..
[link] [comments]