InfiniPot-V: ストリーミングビデオのためのメモリ制約下におけるKVキャッシュ圧縮理解

要旨

現代のマルチモーダル大規模言語モデル（MLLM）は、1時間以上の動画を推論することが可能ですが、そのキー・バリュー（KV）キャッシュは時間とともに線形に増加し、スマートフォン、ARグラス、エッジロボットの固定メモリをすぐに超えてしまいます。従来の圧縮手法では、動画全体とユーザークエリがオフラインで利用可能であるか、最初に完全なキャッシュを構築する必要があるため、メモリ使用量は依然としてストリームの長さに比例して増加します。InfiniPot-Vは、ストリーミング動画理解において、長さに依存しない厳密なメモリ上限を強制する、初めてのトレーニング不要でクエリに依存しないフレームワークです。動画エンコーディング中にキャッシュを監視し、ユーザー設定の閾値に達すると、軽量な圧縮処理を実行します。この処理では、(i) 時間軸冗長性（TaR）メトリックを使用して時間的に冗長なトークンを削除し、(ii) 値ノルム（VaN）ランキングを使用して意味的に重要なトークンを保持します。4つのオープンソースMLLMと4つの長尺動画および2つのストリーミング動画ベンチマークにおいて、InfiniPot-VはピークGPUメモリを最大94%削減し、リアルタイム生成を維持し、完全キャッシュの精度を上回るか同等の性能を発揮します。これにより、再トレーニングやクエリの知識を必要とせずにKVキャッシュのボトルネックを解消し、InfiniPot-Vはオンデバイスストリーミング動画アシスタントの実現に大きく近づきました。

English

Modern multimodal large language models (MLLMs) can reason over hour-long video, yet their key-value (KV) cache grows linearly with time--quickly exceeding the fixed memory of phones, AR glasses, and edge robots. Prior compression schemes either assume the whole video and user query are available offline or must first build the full cache, so memory still scales with stream length. InfiniPot-V is the first training-free, query-agnostic framework that enforces a hard, length-independent memory cap for streaming video understanding. During video encoding it monitors the cache and, once a user-set threshold is reached, runs a lightweight compression pass that (i) removes temporally redundant tokens via Temporal-axis Redundancy (TaR) metric and (ii) keeps semantically significant tokens via Value-Norm (VaN) ranking. Across four open-source MLLMs and four long-video and two streaming-video benchmarks, InfiniPot-V cuts peak GPU memory by up to 94%, sustains real-time generation, and matches or surpasses full-cache accuracy--even in multi-turn dialogues. By dissolving the KV cache bottleneck without retraining or query knowledge, InfiniPot-V closes the gap for on-device streaming video assistants.

InfiniPot-V: ストリーミングビデオのためのメモリ制約下におけるKVキャッシュ圧縮理解

InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

要旨

Support