How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms
arXiv:2604.08966v1 Announce Type: new
Abstract: While Multimodal Large Language Models (MLLMs) have advanced Video Temporal Grounding (VTG), existing methods often couple output paradigms with different backbones, datasets, and training protocols. Thi…