cs.AI, cs.CV

PushupBench: Your VLM is not good at counting pushups

arXiv:2604.23407v1 Announce Type: cross
Abstract: Large vision-language models (VLMs) can recognize \textit{what} happens in video but fail to count \textit{how many} times. We introduce \textbf{PushupBench}, 446 long-form clips (avg. 36.7s) for evalu…