cs.CV

Beyond Referring Expressions: Scenario Comprehension Visual Grounding

arXiv:2604.02323v1 Announce Type: new
Abstract: Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explor…