Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding
arXiv:2604.01002v1 Announce Type: cross
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational …