Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
arXiv:2601.10611v4 Announce Type: replace
Abstract: Today’s strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not di…