Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
arXiv:2603.29252v1 Announce Type: new
Abstract: Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mec…