Multimodal Representation Learning Conditioned on Semantic Relations
arXiv:2508.17497v2 Announce Type: replace-cross
Abstract: Multimodal representation learning has been largely driven by contrastive models such as CLIP, which learn a shared embedding space by aligning paired image-text samples. While effective for ge…