The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding
arXiv:2412.08110v3 Announce Type: replace-cross
Abstract: Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object…