Ami Baid, Zihui Xue, Kristen Grauman

Don’t Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

Ami Baid, Zihui Xue, Kristen Grauman / April 16, 2026

arXiv:2604.14129v1 Announce Type: new
Abstract: While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is v…

Author name: Ami Baid, Zihui Xue, Kristen Grauman

Don’t Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models