Geometry-Guided 3D Visual Token Pruning for Video-Language Models
arXiv:2604.18260v1 Announce Type: new
Abstract: Multimodal large language models have demonstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos comp…