How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning
arXiv:2603.24866v1 Announce Type: cross
Abstract: The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision-language models (VLMs) remains heavily skewed toward perceptual …