Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models
arXiv:2605.11959v1 Announce Type: cross
Abstract: Multimodal video summarization requires visual features that align semantically with language generation. Traditional approaches rely on CNN features trained for object classification, which represent …