InfiniPot-V：面向流媒體視頻的記憶體受限鍵值快取壓縮技術解析

摘要

現代多模態大型語言模型（MLLMs）能夠對長達一小時的視頻進行推理，但其鍵值（KV）緩存隨時間線性增長——迅速超出手機、AR眼鏡和邊緣機器人的固定記憶體容量。先前的壓縮方案要么假設整個視頻和用戶查詢可離線獲取，要么必須先構建完整的緩存，因此記憶體仍隨流長度擴展。InfiniPot-V是首個無需訓練、與查詢無關的框架，它為流媒體視頻理解實施了嚴格的、與長度無關的記憶體上限。在視頻編碼過程中，它監控緩存，一旦達到用戶設定的閾值，便運行輕量級壓縮過程，該過程（i）通過時間軸冗餘（TaR）度量移除時間上冗餘的令牌，以及（ii）通過值範數（VaN）排名保留語義上重要的令牌。在四個開源MLLM和四個長視頻及兩個流媒體視頻基準測試中，InfiniPot-V將GPU峰值記憶體減少高達94%，維持實時生成，並匹配或超越全緩存準確率——即使在多輪對話中也是如此。通過在不重新訓練或了解查詢的情況下消除KV緩存瓶頸，InfiniPot-V為設備端流媒體視頻助手彌合了差距。

English

Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time--quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and two streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy--even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.

InfiniPot-V：面向流媒體視頻的記憶體受限鍵值快取壓縮技術解析

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

摘要

Support