VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation
arXiv:2604.25235v1 Announce Type: cross
Abstract: Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction…