J. de Curt\`o, Mauro Liz, I. de Zarz\`a

Language-Conditioned Visual Grounding with CLIP Multilingual

J. de Curt\`o, Mauro Liz, I. de Zarz\`a / May 12, 2026

arXiv:2605.09060v1 Announce Type: new
Abstract: Multilingual vision-language models exhibit systematic performance gaps across languages, but the mechanism remains ambiguous: cross-language divergence could arise from the visual encoder, the text bran…

Author name: J. de Curt\`o, Mauro Liz, I. de Zarz\`a

Language-Conditioned Visual Grounding with CLIP Multilingual