ArrowGEV: Grounding Events in Video via Learning the Arrow of Time
arXiv:2601.06559v2 Announce Type: replace
Abstract: Grounding events in videos serves as a fundamental capability in video analysis. While Vision Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train m…