Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
arXiv:2603.04676v2 Announce Type: replace
Abstract: Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image…