Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
arXiv:2603.17980v2 Announce Type: replace
Abstract: Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like poi…