Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

arXiv:2505.03821v2 Announce Type: replace Abstract: We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a new set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations -- such as object position relative to the minifigure and the minifigure's orientation -- and using both bird's-eye and surface-level views, we created 144 unique visual tasks. Each task is paired with a series of 7 diagnostic questions designed to assess three levels of visual cognition: scene understanding, spatial reasoning, and visual perspective taking. We evaluate several high-performing models, including Gemini Robotics-ER 1.5, Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, GPT-4, and Qwen3, and find that while they excel at scene understanding, performance declines markedly on spatial reasoning and deteriorates further on perspective taking. Our analysis suggests a gap between surface-level object recognition and the deeper spatial and perspective reasoning required for complex visual tasks, pointing to the need for integrating explicit geometric representations and tailored training protocols in future VLM development.

Leave a Comment