Duy Le Dinh Anh, Patrick Amadeus Irawan, Tuan Van Vo

Counting to Four is still a Chore for VLMs

Duy Le Dinh Anh, Patrick Amadeus Irawan, Tuan Van Vo / April 14, 2026

arXiv:2604.10039v1 Announce Type: new
Abstract: Vision–language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mos…

Author name: Duy Le Dinh Anh, Patrick Amadeus Irawan, Tuan Van Vo

Counting to Four is still a Chore for VLMs