FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
arXiv:2504.09925v3 Announce Type: replace
Abstract: We introduce FLARE, a family of vision language models (VLMs) with a fully vision-language alignment and integration paradigm. Unlike existing approaches that rely on single MLP projectors for modali…