H_2O：用於高效生成推論大型語言模型的重要預測器

摘要

儘管大型語言模型（LLMs）最近取得了令人印象深刻的成就，但在部署上仍然成本高昂，特別是對於涉及長內容生成的應用，如對話系統和故事撰寫。通常，除了模型參數外，還會在 GPU 記憶體中存儲大量瞬態狀態信息，稱為 KV 快取，其與序列長度和批次大小呈線性關係。本文介紹了一種實現 KV 快取的新方法，顯著減少了其記憶體佔用量。我們的方法基於一個引人注目的觀察，即在計算注意力分數時，一小部分標記貢獻了大部分價值。我們稱這些標記為重要標記（H_2）。通過全面調查，我們發現（i）H_2 的出現是自然的，並與文本中標記的頻繁共現密切相關，以及（ii）刪除它們將導致顯著的性能下降。基於這些見解，我們提出了重要標記預測（H_2O），一種 KV 快取淘汰策略，動態保留最近和 H_2 標記的平衡。我們將 KV 快取淘汰定義為一個動態次模模問題，並證明（在溫和的假設下）我們的新淘汰算法具有理論保證，這有助於指導未來的工作。我們通過 OPT、LLaMA 和 GPT-NeoX 在各種任務上驗證了我們算法的準確性。我們實現的含有 20% 重要標記的 H_2O 比三個領先的推理系統 DeepSpeed Zero-Inference、Hugging Face Accelerate 和 FlexGen 的吞吐量提高了高達 29 倍、29 倍和 3 倍，分別在 OPT-6.7B 和 OPT-30B 上。在相同的批次大小下，H2O 可將延遲降低高達 1.9 倍。代碼可在 https://github.com/FMInference/H2O 找到。

English

Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H_2). Through a comprehensive investigation, we find that (i) the emergence of H_2 is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (H_2O), a KV cache eviction policy that dynamically retains a balance of recent and H_2 tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of H_2O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29times, 29times, and 3times on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9times. The code is available at https://github.com/FMInference/H2O.

H_2O：用於高效生成推論大型語言模型的重要預測器

H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

摘要

Support