MS-Mix: Sentiment-Guided Adaptive Augmentation for Multimodal Sentiment Analysis
arXiv:2510.11579v3 Announce Type: replace
Abstract: Multimodal Sentiment Analysis (MSA) integrates complementary features from text, video, and audio for robust emotion understanding in human interactions. However, models suffer from severe data scarcity and high annotation costs, severely limiting real-world deployment in social media analytics and human-computer systems. Existing Mixup-based augmentation techniques, when naively applied to MSA, often produce semantically inconsistent samples and amplified label noise by ignoring emotional semantics across modalities. To address these challenges, we propose MS-Mix, an adaptive emotion-sensitive augmentation framework that automatically optimizes data quality in multimodal settings. Its key components are: (1) Sentiment-aware sample selection strategy that filters incompatible pairs via latent-space semantic similarity to prevent contradictory emotion mixing. (2) Sentiment intensity guided module with multi-head self-attention for computing modality-specific mixing ratios conditioned on emotional salience dynamically. (3) Sentiment alignment loss based on Kullback-Leibler divergence to align predicted sentiment distributions across modalities with ground-truth labels, improving discrimination and consistency. Extensive experiments on two public datasets with six state-of-the-art backbones confirm that MS-Mix consistently outperforms prior methods, significantly improving robustness and practical applicability for MSA. The source code is available at an anonymous link: https://anonymous.4open.science/r/MS-Mix-review-0C72.