cs.AI, cs.CV, cs.LG

Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding

arXiv:2602.02977v2 Announce Type: replace-cross
Abstract: Vision-language models such as CLIP often struggle to faithfully understand long, detail-rich captions, relying on dominant scene cues while overlooking fine-grained visual evidence. We propose…