The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding
arXiv:2603.26589v1 Announce Type: new
Abstract: What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptu…