VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
arXiv:2604.05418v2 Announce Type: replace
Abstract: Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevan…