cs.CV

Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images

arXiv:2604.11025v1 Announce Type: new
Abstract: Recent multimodal large language models (MLLMs) have begun to support Thinking with Images by invoking visual tools such as zooming and cropping during inference. Yet these systems remain brittle in fine…