HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
arXiv:2601.14724v3 Announce Type: replace
Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inpu…