cs.CL, cs.CV

Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models

arXiv:2605.11959v1 Announce Type: cross
Abstract: Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent …