Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
arXiv:2604.17422v1 Announce Type: new
Abstract: Long video understanding remains a formidable challenge for Multimodal Large Language Models (MLLMs) due to the prohibitive computational cost of processing dense frame sequences. Prevailing solutions, w…