cs.CL

When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation

arXiv:2601.04609v2 Announce Type: replace
Abstract: Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their l…