Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
arXiv:2603.01400v2 Announce Type: replace
Abstract: Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens. Existing pruning primary targets intra-frame spatial redundancy…