cs.CV

SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

arXiv:2604.02252v1 Announce Type: new
Abstract: Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level repr…