MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding
arXiv:2605.03398v1 Announce Type: new
Abstract: Video Temporal Grounding (VTG) faces a cross-modal semantic gap that often leads to background features being incorrectly aligned with the query, while directly matching the query to moments results in i…