Seeing Isn’t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
arXiv:2604.21523v1 Announce Type: new
Abstract: Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. D…