Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
arXiv:2604.11496v2 Announce Type: replace-cross
Abstract: Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks. We argue that this limitation …