Multimodal Representation Learning Conditioned on Semantic Relations
arXiv:2508.17497v2 Announce Type: replace-cross
Abstract: Multimodal representation learning has been largely driven by contrastive models such as CLIP, which learn a shared embedding space by aligning paired image-text samples. While effective for general-purpose representation learning, such models typically produce a single embedding per sample that is reused across different semantic relations and contexts. However, in many real-world applications, relevance between samples is inherently relation-dependent, with different semantic relations emphasizing different aspects of multimodal data.
In this work, we propose Relation-Conditioned Multimodal Learning (RCML), a framework that treats semantic relations as explicit conditions of multimodal representation learning. Rather than producing relation-agnostic embeddings, RCML learns representations conditioned on natural-language relation descriptions, allowing the same sample to be represented differently under different relational contexts. The framework constructs relation-aware training pairs, introduces a relation-conditioned module to adapt embeddings to relation semantics, and employs a unified contrastive objective to jointly model cross-modal alignment and relation-induced inter-sample structure.
Experiments on multiple datasets show that RCML consistently outperforms strong baselines on retrieval and classification tasks in zero-shot, fine-tuned, and out-of-domain settings, highlighting the effectiveness of leveraging semantic relations to guide multimodal representation learning.