cs.AI, cs.CV, cs.MM, cs.SD

Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

arXiv:2604.04229v1 Announce Type: cross
Abstract: Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurre…