cs.AI, cs.CV

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

arXiv:2602.18846v2 Announce Type: replace
Abstract: Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency …