One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework
arXiv:2510.02898v4 Announce Type: replace
Abstract: Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proc…