cs.AI, cs.CV

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

arXiv:2604.22498v1 Announce Type: cross
Abstract: Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention …