cs.CV, cs.LG

All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

arXiv:2604.12335v1 Announce Type: new
Abstract: Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However…