Token Warping Helps MLLMs Look from Nearby Viewpoints
arXiv:2604.02870v1 Announce Type: new
Abstract: Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fra…