Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
arXiv:2603.23404v2 Announce Type: replace-cross
Abstract: Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge…