ChunkKV：保持語義的 KV 快取壓縮，以提高長內文 LLM 推論效率

摘要

為了降低大型語言模型（LLMs）中長篇推論的記憶成本，許多最近的研究專注於壓縮不同標記的關鍵-值（KV）快取。然而，我們發現先前的KV快取壓縮方法是以個別衡量標記重要性，忽略了現實語言特性中不同標記之間的依賴關係。基於這一點，我們引入了ChunkKV，將一組標記作為基本壓縮單元，保留最具信息量的語義塊，同時捨棄較不重要的部分。此外，觀察到ChunkKV在不同層之間保留索引時呈現較高的相似性，我們提出了層級索引重複使用，進一步降低計算開銷。我們在包括LongBench和Needle-In-A-HayStack在內的尖端長篇推論基準測試中評估了ChunkKV，以及GSM8K和JailbreakV的上下文學習基準測試。我們對指令調整和多步推理（O1和R1）LLMs進行了實驗，在激進的壓縮比下，與現有方法相比實現了高達10\%的性能改善。

English

To reduce memory costs in long-context inference with Large Language Models (LLMs), many recent works focus on compressing the key-value (KV) cache of different tokens. However, we identify that the previous KV cache compression methods measure token importance individually, neglecting the dependency between different tokens in the real-world language characterics. In light of this, we introduce ChunkKV, grouping the tokens in a chunk as a basic compressing unit, and retaining the most informative semantic chunks while discarding the less important ones. Furthermore, observing that ChunkKV exhibits higher similarity in the preserved indices across different layers, we propose layer-wise index reuse to further reduce computational overhead. We evaluated ChunkKV on cutting-edge long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K and JailbreakV in-context learning benchmark. Our experiments with instruction tuning and multi-step reasoning (O1 and R1) LLMs, achieve up to 10\% performance improvement under aggressive compression ratios compared to existing methods.

ChunkKV：保持語義的 KV 快取壓縮，以提高長內文 LLM 推論效率

ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

摘要

Support