HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models
arXiv:2508.00553v3 Announce Type: replace
Abstract: Vision-Language Models (VLMs) encode images and videos into abundant tokens, which contain substantial redundancy and computation cost. While visual token pruning mitigates the issue, most existing m…