Make Geometry Matter for Spatial Reasoning
arXiv:2603.26639v1 Announce Type: new
Abstract: Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos rema…