cs.AI, cs.CV

VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

arXiv:2508.06869v4 Announce Type: replace
Abstract: Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computationa…