VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
arXiv:2605.12325v2 Announce Type: replace
Abstract: Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of e…