Savya Khosla, Sethuraman T V, Aryan Chadha, Alex Schwing, Derek Hoiem

T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

Savya Khosla, Sethuraman T V, Aryan Chadha, Alex Schwing, Derek Hoiem / April 21, 2026

arXiv:2604.18573v1 Announce Type: new
Abstract: Despite recent progress, vision-language encoders struggle with two core limitations: (1) weak alignment between language and dense vision features, which hurts tasks like open-vocabulary semantic segmen…

Author name: Savya Khosla, Sethuraman T V, Aryan Chadha, Alex Schwing, Derek Hoiem

T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability