G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
arXiv:2605.12309v1 Announce Type: new
Abstract: The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual…