Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs
arXiv:2509.19207v2 Announce Type: replace
Abstract: Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, yet understanding long, compositional captions remains an open challenge. While the…