適用於視頻大型語言模型的即插即用1.x位鍵值快取量化技術

摘要

視訊大型語言模型（VideoLLMs）已展現出處理更長視訊輸入並實現複雜推理與分析的能力。然而，由於視訊幀產生的數千個視覺標記，鍵值（KV）快取會顯著增加記憶體需求，成為推理速度和記憶體使用的瓶頸。KV快取量化是解決此問題的廣泛應用方法。本文中，我們發現對VideoLLMs進行2位元KV量化幾乎不會影響模型性能，但更低位元的KV量化極限尚未被探討。為填補這一空白，我們提出了VidKV，一種即插即用的KV快取量化方法，將KV快取壓縮至低於2位元。具體而言，（1）對於鍵，我們在通道維度上提出了一種混合精度量化策略，對異常通道進行2位元量化，對正常通道則結合FFT進行1位元量化；（2）對於值，我們實施了1.58位元量化，同時選擇性過濾語義顯著的視覺標記以進行有針對性的保留，以在精度與模型性能之間取得更好的平衡。重要的是，我們的研究表明，VideoLLMs的值快取應以逐通道方式進行量化，而非先前LLMs的KV快取量化工作中提出的逐標記方式。實驗中，LLaVA-OV-7B和Qwen2.5-VL-7B在六個基準測試上的廣泛結果顯示，VidKV有效地將KV快取壓縮至1.5位元和1.58位元精度，與FP16版本相比幾乎沒有性能下降。

English

Video large language models (VideoLLMs) have demonstrated the capability to process longer video inputs and enable complex reasoning and analysis. However, due to the thousands of visual tokens from the video frames, key-value (KV) cache can significantly increase memory requirements, becoming a bottleneck for inference speed and memory usage. KV cache quantization is a widely used approach to address this problem. In this paper, we find that 2-bit KV quantization of VideoLLMs can hardly hurt the model performance, while the limit of KV cache quantization in even lower bits has not been investigated. To bridge this gap, we introduce VidKV, a plug-and-play KV cache quantization method to compress the KV cache to lower than 2 bits. Specifically, (1) for key, we propose a mixed-precision quantization strategy in the channel dimension, where we perform 2-bit quantization for anomalous channels and 1-bit quantization combined with FFT for normal channels; (2) for value, we implement 1.58-bit quantization while selectively filtering semantically salient visual tokens for targeted preservation, for a better trade-off between precision and model performance. Importantly, our findings suggest that the value cache of VideoLLMs should be quantized in a per-channel fashion instead of the per-token fashion proposed by prior KV cache quantization works for LLMs. Empirically, extensive results with LLaVA-OV-7B and Qwen2.5-VL-7B on six benchmarks show that VidKV effectively compresses the KV cache to 1.5-bit and 1.58-bit precision with almost no performance drop compared to the FP16 counterparts.

適用於視頻大型語言模型的即插即用1.x位鍵值快取量化技術

Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models

摘要

Support