Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
arXiv:2604.11244v2 Announce Type: replace
Abstract: Advances in Multimodal Large Language Models (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the …