cs.CV, cs.MM

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

arXiv:2604.11102v1 Announce Type: new
Abstract: Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded…