FastKV：KV快取壓縮，用於具有令牌選擇性傳播的快速長內容處理

摘要

儘管大型語言模型（LLMs）擅長處理長文本序列，但它們需要大量的鍵-值（KV）緩存來存儲上下文信息，這可能會嚴重影響計算效率和內存使用。先前壓縮這些KV緩存的努力主要集中在減少內存需求，但在提高延遲方面受到限制。為解決此問題，我們引入了FastKV，這是一種旨在提高長文本序列延遲的KV緩存壓縮方法。為了提高處理速度並保持準確性，FastKV採用了一種新穎的Token-Selective Propagation（TSP）方法，在LLMs的初始層中保留完整的上下文信息，並在更深的層中選擇性地傳播部分信息，即使在預填充階段也是如此。此外，FastKV還融入了分組查詢注意力（GQA）感知的KV緩存壓縮，以利用GQA在內存和計算效率方面的優勢。我們的實驗結果顯示，與當前最先進的KV緩存壓縮方法HeadKV相比，FastKV在首個標記到達時間（TTFT）和吞吐量方面分別實現了2.00倍和1.40倍的改進。此外，FastKV成功地在長文本基準測試中保持了與基準線相當的準確性水平。我們的代碼可在https://github.com/dongwonjo/FastKV 找到。

English

While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to enhance latency for long-context sequences. To enhance processing speeds while maintaining accuracy, FastKV adopts a novel Token-Selective Propagation (TSP) approach that retains the full context information in the initial layers of LLMs and selectively propagates only a portion of this information in deeper layers even in the prefill stage. Additionally, FastKV incorporates grouped-query attention (GQA)-aware KV cache compression to exploit the advantages of GQA in both memory and computational efficiency. Our experimental results show that FastKV achieves 2.00times and 1.40times improvements in time-to-first-token (TTFT) and throughput, respectively, compared to HeadKV, the state-of-the-art KV cache compression method. Moreover, FastKV successfully maintains accuracy on long-context benchmarks at levels comparable to the baselines. Our code is available at https://github.com/dongwonjo/FastKV.