cs.CL

Language-Conditioned Visual Grounding with CLIP Multilingual

arXiv:2605.09060v1 Announce Type: new
Abstract: Multilingual vision-language models exhibit systematic performance gaps across languages, but the mechanism remains ambiguous: cross-language divergence could arise from the visual encoder, the text bran…