See, Symbolize, Act: Grounding VLMs with Spatial Representations for Better Gameplay
arXiv:2603.11601v2 Announce Type: replace
Abstract: Vision-Language Models (VLMs) excel at describing visual scenes, yet struggle to translate perception into precise, grounded actions. We investigate whether providing VLMs with both the visual frame …