cs.CV

T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

arXiv:2604.18573v1 Announce Type: new
Abstract: Despite recent progress, vision-language encoders struggle with two core limitations: (1) weak alignment between language and dense vision features, which hurts tasks like open-vocabulary semantic segmen…