cs.CL, cs.CV

Inference-Time Structural Reasoning for Compositional Vision-Language Understanding

arXiv:2603.27349v1 Announce Type: cross
Abstract: Vision-language models (VLMs) excel at image-text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We pr…