Don’t Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
arXiv:2604.14129v1 Announce Type: new
Abstract: While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is v…