Byeongju Woo, Zilin Wang, Byeonghyun Pak, Sangwoo Mo, Stella X. Yu

Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding

Byeongju Woo, Zilin Wang, Byeonghyun Pak, Sangwoo Mo, Stella X. Yu / May 14, 2026

arXiv:2602.02977v2 Announce Type: replace-cross
Abstract: Vision-language models such as CLIP often struggle to faithfully understand long, detail-rich captions, relying on dominant scene cues while overlooking fine-grained visual evidence. We propose…

Author name: Byeongju Woo, Zilin Wang, Byeonghyun Pak, Sangwoo Mo, Stella X. Yu

Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding