Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
arXiv:2512.04000v2 Announce Type: replace-cross
Abstract: The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video t…