Robust Grounding with MLLMs Against Occlusion and Small Objects via Language-Guided Semantic Cues
arXiv:2604.24036v2 Announce Type: replace
Abstract: While Multimodal Large Language Models (MLLMs) have enhanced grounding capabilities in general scenes, their robustness in crowded scenes remains underexplored. Crowded scenes entail visual challenge…