PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders
arXiv:2603.25398v1 Announce Type: new
Abstract: Vision Foundation Models (VFMs) pre-trained at scale enable a single frozen encoder to serve multiple downstream tasks simultaneously. Recent VFM-based encoder-only models for image and video segmentatio…