One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
arXiv:2604.14149v2 Announce Type: replace
Abstract: Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of toke…