InfiniPot-V：面向流媒体视频的内存受限KV缓存压缩技术解析

摘要

现代多模态大语言模型（MLLMs）能够对长达一小时的视频进行推理，但其键值（KV）缓存随时间的推移线性增长——迅速超出手机、增强现实眼镜和边缘机器人的固定内存容量。现有的压缩方案要么假设整个视频和用户查询可离线获取，要么必须先构建完整的缓存，因此内存仍随流媒体长度扩展。InfiniPot-V是首个无需训练、与查询无关的框架，它为流媒体视频理解实施了一个严格的、与长度无关的内存上限。在视频编码过程中，它监控缓存，一旦达到用户设定的阈值，便执行轻量级压缩步骤，包括：（i）通过时间轴冗余（TaR）度量移除时间上冗余的令牌，以及（ii）通过值范数（VaN）排序保留语义上重要的令牌。在四个开源MLLM及四个长视频和两个流媒体视频基准测试中，InfiniPot-V将GPU峰值内存削减高达94%，维持实时生成，并匹配或超越完整缓存的准确性——即便在多轮对话中也是如此。通过无需重新训练或查询知识即可消除KV缓存瓶颈，InfiniPot-V为设备端流媒体视频助手缩小了差距。

English

Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time--quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and two streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy--even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.

InfiniPot-V：面向流媒体视频的内存受限KV缓存压缩技术解析

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

摘要

Support