Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning

arXiv:2604.05834v2 Announce Type: replace Abstract: Contrastive learning has become a standard approach for unsupervised learning from paired data, as demonstrated by CLIP for image-text matching. However, many domains involve more than two modalities and require objectives that capture higher-order dependencies beyond pairwise alignment. Symile extends CLIP to this setting by replacing the dot product with the multilinear inner product (MIP) over modality embeddings. In this work, we show that there is a fragility which ishidden in the multiplicative interaction: a single weakly informative, misaligned, or missing modality can propagate through the objective and distort cross-modal retrieval scores. We propose Gated Symile, a contrastive gating mechanism that adapts modality contributions on an attention-based, per-candidate basis. The gate suppresses unreliable inputs by interpolating embeddings toward learnable neutral directions with an explicit NULL option when reliable cross-modal alignment is unlikely. Across a controlled synthetic benchmark that uncovers this fragility and three real-world trimodal datasets, Gated Symile achieves higher top-1 retrieval accuracy than well-tuned state-of-the-art (sota) baselines. More broadly, our results highlight gating as a step toward robust multimodal contrastive learning beyond two modalities in the presence of noise, misalignment, or missing inputs.

Leave a Comment