SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding
arXiv:2603.25733v1 Announce Type: new
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal underst…