cs.AI, cs.CV

Make Geometry Matter for Spatial Reasoning

arXiv:2603.26639v1 Announce Type: new
Abstract: Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos rema…