cs.CV

Structure Over Scale: Learning Visual Reasoning from Pedagogical Video

arXiv:2601.23251v2 Announce Type: replace
Abstract: State-of-the-art vision-language models (VLMs) score impressively on video benchmarks yet stumble on basic visual reasoning tasks involving spatial relations, navigation, and object selection that a …