TrackMAE: Video Representation Learning via Track Mask and Predict
arXiv:2603.27268v1 Announce Type: new
Abstract: Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only encodes motion information implicitly, limiting the encoding of temporal dynamics in the le…