cs.CV

GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding

arXiv:2604.02093v1 Announce Type: new
Abstract: Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely…