Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas

Structure Over Scale: Learning Visual Reasoning from Pedagogical Video

Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas / May 11, 2026

arXiv:2601.23251v2 Announce Type: replace
Abstract: State-of-the-art vision-language models (VLMs) score impressively on video benchmarks yet stumble on basic visual reasoning tasks involving spatial relations, navigation, and object selection that a …

Author name: Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas

Structure Over Scale: Learning Visual Reasoning from Pedagogical Video