Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
arXiv:2604.08014v3 Announce Type: replace
Abstract: Spatio-Temporal Video Grounding requires jointly localizing target objects across both temporal and spatial dimensions based on natural language queries, posing fundamental challenges for existing Mu…