cs.CL, cs.CV

Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning

arXiv:2603.23404v2 Announce Type: replace-cross
Abstract: Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge…