OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
arXiv:2604.11102v1 Announce Type: new
Abstract: Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded…