Amartya Bhattacharya

Inference-Time Structural Reasoning for Compositional Vision-Language Understanding

Amartya Bhattacharya / March 31, 2026

arXiv:2603.27349v1 Announce Type: cross
Abstract: Vision-language models (VLMs) excel at image-text retrieval yet persistently fail at compositional reasoning, distinguishing captions that share the same words but differ in relational structure. We pr…

Author name: Amartya Bhattacharya

Inference-Time Structural Reasoning for Compositional Vision-Language Understanding