cs.AI, cs.CV

Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

arXiv:2603.04676v2 Announce Type: replace
Abstract: Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image…