Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
arXiv:2605.12034v2 Announce Type: replace-cross
Abstract: Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We …