High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
arXiv:2507.05920v2 Announce Type: replace
Abstract: State-of-the-art large multi-modal models (LMMs) face challenges when processing high-resolution images, as these inputs are converted into enormous visual tokens, many of which are irrelevant to the…