Israfel Salazar, Desmond Elliott, Yova Kementchedjhieva

Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs

Israfel Salazar, Desmond Elliott, Yova Kementchedjhieva / May 13, 2026

arXiv:2509.19207v2 Announce Type: replace
Abstract: Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, yet understanding long, compositional captions remains an open challenge. While the…

Author name: Israfel Salazar, Desmond Elliott, Yova Kementchedjhieva

Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs