비디오 대형 언어 모델을 위한 플러그 앤 플레이 1.x-비트 KV 캐시 양자화

초록

비디오 대형 언어 모델(VideoLLMs)은 더 긴 비디오 입력을 처리하고 복잡한 추론 및 분석을 가능하게 하는 능력을 입증했습니다. 그러나 비디오 프레임에서 수천 개의 시각적 토큰이 생성되기 때문에 키-값(KV) 캐시는 메모리 요구 사항을 크게 증가시켜 추론 속도와 메모리 사용의 병목 현상이 될 수 있습니다. 이를 해결하기 위해 KV 캐시 양자화가 널리 사용되는 접근 방식입니다. 본 논문에서는 VideoLLMs의 2비트 KV 양자화가 모델 성능에 거의 영향을 미치지 않음을 발견했으며, 더 낮은 비트에서의 KV 캐시 양자화 한계는 아직 연구되지 않았습니다. 이 간극을 메우기 위해, 우리는 KV 캐시를 2비트 미만으로 압축하는 플러그 앤 플레이 KV 캐시 양자화 방법인 VidKV를 소개합니다. 구체적으로, (1) 키에 대해서는 채널 차원에서 혼합 정밀도 양자화 전략을 제안하며, 여기서 비정상 채널에 대해서는 2비트 양자화를 수행하고 정상 채널에 대해서는 1비트 양자화와 FFT를 결합합니다; (2) 값에 대해서는 1.58비트 양자화를 구현하면서 의미론적으로 중요한 시각적 토큰을 선택적으로 필터링하여 정밀도와 모델 성능 간의 더 나은 균형을 달성합니다. 중요한 것은, VideoLLMs의 값 캐시는 기존 LLMs용 KV 캐시 양자화 연구에서 제안된 토큰 단위 방식이 아닌 채널 단위 방식으로 양자화되어야 한다는 우리의 발견입니다. 실험적으로, LLaVA-OV-7B와 Qwen2.5-VL-7B를 사용한 6개 벤치마크에서의 광범위한 결과는 VidKV가 KV 캐시를 1.5비트 및 1.58비트 정밀도로 효과적으로 압축하면서 FP16 대비 거의 성능 저하 없이 작동함을 보여줍니다.

English

Video large language models (VideoLLMs) have demonstrated the capability to process longer video inputs and enable complex reasoning and analysis. However, due to the thousands of visual tokens from the video frames, key-value (KV) cache can significantly increase memory requirements, becoming a bottleneck for inference speed and memory usage. KV cache quantization is a widely used approach to address this problem. In this paper, we find that 2-bit KV quantization of VideoLLMs can hardly hurt the model performance, while the limit of KV cache quantization in even lower bits has not been investigated. To bridge this gap, we introduce VidKV, a plug-and-play KV cache quantization method to compress the KV cache to lower than 2 bits. Specifically, (1) for key, we propose a mixed-precision quantization strategy in the channel dimension, where we perform 2-bit quantization for anomalous channels and 1-bit quantization combined with FFT for normal channels; (2) for value, we implement 1.58-bit quantization while selectively filtering semantically salient visual tokens for targeted preservation, for a better trade-off between precision and model performance. Importantly, our findings suggest that the value cache of VideoLLMs should be quantized in a per-channel fashion instead of the per-token fashion proposed by prior KV cache quantization works for LLMs. Empirically, extensive results with LLaVA-OV-7B and Qwen2.5-VL-7B on six benchmarks show that VidKV effectively compresses the KV cache to 1.5-bit and 1.58-bit precision with almost no performance drop compared to the FP16 counterparts.

비디오 대형 언어 모델을 위한 플러그 앤 플레이 1.x-비트 KV 캐시 양자화

Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models

초록

Support