ChatPaper.aiChatPaper

HERMES:将KV缓存作为分层内存的高效流式视频理解框架

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

January 21, 2026
作者: Haowei Zhang, Shudong Yang, Jinlan Fu, See-Kiong Ng, Xipeng Qiu
cs.AI

摘要

近期,多模态大语言模型(MLLMs)在离线视频理解方面取得了显著进展。然而,将其能力扩展至流式视频输入仍面临挑战,因为现有模型难以同时保持稳定的理解性能、实时响应能力与低GPU内存开销。为解决这一难题,我们提出HERMES——一种无需重新训练的新型架构,可实现流式视频的实时精准理解。基于对注意力机制的原理性探究,我们将KV缓存概念化为跨多粒度封装视频信息的层次化记忆框架。在推理过程中,HERMES通过复用紧凑的KV缓存,实现在资源受限条件下的高效流式理解。值得注意的是,HERMES在用户查询到达时无需进行辅助计算,从而保障了连续视频流交互的实时响应,其首令牌生成速度较此前最优技术提升10倍。即使相比均匀采样将视频令牌数量削减高达68%,HERMES在所有基准测试中仍实现相当或更优的准确率,并在流式数据集上取得最高11.4%的性能提升。
English
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant improvement in offline video understanding. However, extending these capabilities to streaming video inputs, remains challenging, as existing models struggle to simultaneously maintain stable understanding performance, real-time responses, and low GPU memory overhead. To address this challenge, we propose HERMES, a novel training-free architecture for real-time and accurate understanding of video streams. Based on a mechanistic attention investigation, we conceptualize KV cache as a hierarchical memory framework that encapsulates video information across multiple granularities. During inference, HERMES reuses a compact KV cache, enabling efficient streaming understanding under resource constraints. Notably, HERMES requires no auxiliary computations upon the arrival of user queries, thereby guaranteeing real-time responses for continuous video stream interactions, which achieves 10times faster TTFT compared to prior SOTA. Even when reducing video tokens by up to 68% compared with uniform sampling, HERMES achieves superior or comparable accuracy across all benchmarks, with up to 11.4% gains on streaming datasets.
PDF521January 24, 2026