URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection

arXiv:2604.06728v2 Announce Type: replace-cross Abstract: Multimodal sarcasm detection (MSD) aims to identify sarcastic intent from semantic incongruity between text and image. Although recent methods have improved MSD through cross-modal interaction and incongruity reasoning, most still treat modalities as equally reliable. In real social media posts, however, text and images often differ in noise level and relevance, making deterministic fusion susceptible to noisy evidence and weakened incongruity cues. To address this issue, we propose Uncertainty-aware Robust Multimodal Fusion (URMF), a unified framework for robust MSD. URMF first injects visual evidence into textual representations through multi-head cross-attention, and then applies self-attention in the fused semantic space to enhance incongruity reasoning. It models textual, visual, and interaction-aware representations as learnable Gaussian posteriors to estimate modality-specific uncertainty. Based on the estimated uncertainty, URMF dynamically adjusts modality contributions during fusion to suppress unreliable evidence. We further optimize the model with a unified objective that combines information bottleneck regularization, modality prior regularization, cross-modal distribution alignment, and uncertainty-driven contrastive learning. Experiments on the public MSD and MMSD2 benchmarks show that URMF outperforms representative unimodal, multimodal, and MLLM-based baselines. The results demonstrate that explicit uncertainty modeling can improve both accuracy and robustness in multimodal sarcasm detection.

Leave a Comment