Jihwan Hong, Jaeyoung Do

VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation

Jihwan Hong, Jaeyoung Do / March 31, 2026

arXiv:2603.27060v1 Announce Type: new
Abstract: Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model wi…

Author name: Jihwan Hong, Jaeyoung Do

VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation