Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
arXiv:2507.00748v3 Announce Type: replace
Abstract: Multimodal Large Language Models (MLLMs) perform well in single-image visual grounding but struggle with real-world tasks that demand cross-image reasoning and multi-modal instructions. To address th…