cs.CV

Scaling Video Understanding via Compact Latent Multi-Agent Collaboration

arXiv:2605.00444v1 Announce Type: new
Abstract: Multi-modal large language models (MLLMs) advance vision language understanding but face inherent limitations in long-video tasks due to bounded perception context budgets. Existing agentic methods mitig…