ChatPaper.aiChatPaper

HERMES:將KV快取作為分層記憶體實現高效串流影片理解

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

January 21, 2026
作者: Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu
cs.AI

摘要

近期多模態大型語言模型(MLLMs)在離線影片理解方面取得顯著進展,然而將其能力擴展至串流影片輸入仍面臨挑戰——現有模型難以同時維持穩定的理解效能、即時回應能力與低GPU記憶體負載。為解決此難題,我們提出HERMES創新的免訓練架構,可實現串流影片的精準即時理解。基於注意力機制的結構性分析,我們將KV快取概念化為跨多粒度封裝影片資訊的階層式記憶框架。在推理過程中,HERMES透過重複使用精簡化的KV快取,於資源限制下實現高效串流理解。值得注意的是,HERMES在使用者查詢送達時無需輔助計算,從而保證連續影片串流互動的即時回應,其首個標記生成時間(TTFT)較先前SOTA技術提升10倍速。即使相比均勻採樣最多減少68%的影片標記,HERMES在所有基準測試中仍展現出更優或相當的準確度,並在串流資料集上實現最高11.4%的效能提升。
English
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10times faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
PDF521January 24, 2026