Gillian Rosenberg, Skylar Stadhard, Bruce C. Hansen, Michelle R. Greene

The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding

Gillian Rosenberg, Skylar Stadhard, Bruce C. Hansen, Michelle R. Greene / March 30, 2026

arXiv:2603.26589v1 Announce Type: new
Abstract: What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptu…

Author name: Gillian Rosenberg, Skylar Stadhard, Bruce C. Hansen, Michelle R. Greene

The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding