cs.CV

VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation

arXiv:2603.27060v1 Announce Type: new
Abstract: Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model wi…