ビデオ大規模言語モデルのためのプラグアンドプレイ1.xビットKVキャッシュ量子化

要旨

ビデオ大規模言語モデル（VideoLLMs）は、より長いビデオ入力を処理し、複雑な推論と分析を可能にする能力を実証しています。しかし、ビデオフレームから得られる数千の視覚的トークンにより、キー・バリュー（KV）キャッシュがメモリ要件を大幅に増加させ、推論速度とメモリ使用量のボトルネックとなっています。KVキャッシュの量子化は、この問題に対処するために広く使用されている手法です。本論文では、VideoLLMsの2ビットKV量子化がモデルの性能をほとんど損なわないことを発見しましたが、さらに低いビット数でのKVキャッシュ量子化の限界はまだ調査されていません。このギャップを埋めるため、我々はVidKVを導入します。これは、KVキャッシュを2ビット未満に圧縮するプラグアンドプレイ型のKVキャッシュ量子化手法です。具体的には、(1) キーに対して、チャネル次元での混合精度量子化戦略を提案し、異常なチャネルには2ビット量子化を、通常のチャネルには1ビット量子化とFFTを組み合わせて適用します。(2) バリューに対しては、1.58ビット量子化を実装し、意味的に重要な視覚的トークンを選択的にフィルタリングして保存することで、精度とモデル性能のバランスを改善します。重要な点として、我々の研究結果は、VideoLLMsのバリューキャッシュは、従来のLLMs向けKVキャッシュ量子化研究で提案されたトークンごとではなく、チャネルごとに量子化すべきであることを示唆しています。実験的には、LLaVA-OV-7BとQwen2.5-VL-7Bを用いた6つのベンチマークでの広範な結果が、VidKVがKVキャッシュを1.5ビットおよび1.58ビット精度に効果的に圧縮し、FP16と比較してほとんど性能低下がないことを示しています。

English

Video large language models (VideoLLMs) have demonstrated the capability to process longer video inputs and enable complex reasoning and analysis. However, due to the thousands of visual tokens from the video frames, key-value (KV) cache can significantly increase memory requirements, becoming a bottleneck for inference speed and memory usage. KV cache quantization is a widely used approach to address this problem. In this paper, we find that 2-bit KV quantization of VideoLLMs can hardly hurt the model performance, while the limit of KV cache quantization in even lower bits has not been investigated. To bridge this gap, we introduce VidKV, a plug-and-play KV cache quantization method to compress the KV cache to lower than 2 bits. Specifically, (1) for key, we propose a mixed-precision quantization strategy in the channel dimension, where we perform 2-bit quantization for anomalous channels and 1-bit quantization combined with FFT for normal channels; (2) for value, we implement 1.58-bit quantization while selectively filtering semantically salient visual tokens for targeted preservation, for a better trade-off between precision and model performance. Importantly, our findings suggest that the value cache of VideoLLMs should be quantized in a per-channel fashion instead of the per-token fashion proposed by prior KV cache quantization works for LLMs. Empirically, extensive results with LLaVA-OV-7B and Qwen2.5-VL-7B on six benchmarks show that VidKV effectively compresses the KV cache to 1.5-bit and 1.58-bit precision with almost no performance drop compared to the FP16 counterparts.

ビデオ大規模言語モデルのためのプラグアンドプレイ1.xビットKVキャッシュ量子化

Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models

要旨

Support