Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning
arXiv:2603.26365v1 Announce Type: new
Abstract: Multimodal Large Language Models have demonstrated remarkable capabilities in video understanding, yet face prohibitive computational costs and performance degradation from ”context rot” due to massive…