Adaptive Greedy Frame Selection for Long Video Understanding
arXiv:2603.20180v2 Announce Type: replace-cross
Abstract: Large vision–language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Na…