cs.CL, cs.CV, cs.LG

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

arXiv:2412.08110v3 Announce Type: replace-cross
Abstract: Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object…