cs.CV

Counting to Four is still a Chore for VLMs

arXiv:2604.10039v1 Announce Type: new
Abstract: Vision–language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mos…