VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models
arXiv:2510.08618v2 Announce Type: replace-cross
Abstract: Omni-modal large language models (OLLMs) offer a promising end-to-end solution for slide-enhanced speech recognition due to their inherent multimodal capabilities. However, we found a fundament…