cs.CV

Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning

arXiv:2604.25809v1 Announce Type: new
Abstract: Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual …