Referring Multiple Regions with Large Multimodal Models via Contextual Latent Steering
arXiv:2605.01827v1 Announce Type: new
Abstract: Large Multimodal Models (LMMs) have recently demonstrated their proficiency in holistic visual comprehension. However, most of them struggle to tackle region-level perception guided by visual prompts, es…