cs.CV

CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

arXiv:2602.22419v2 Announce Type: replace
Abstract: CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image …