H_2O: 大規模言語モデルの効率的な生成推論のためのヘビーヒッターオラクル

要旨

大規模言語モデル（LLMs）は、最近の目覚ましい成果にもかかわらず、特に対話システムやストーリー作成などの長文生成を伴うアプリケーションにおいて、展開コストが非常に高くなることが顕著です。多くの場合、モデルパラメータに加えて、KVキャッシュと呼ばれる大量の一時的な状態情報がGPUメモリに保存され、シーケンス長とバッチサイズに比例して増加します。本論文では、KVキャッシュのメモリ使用量を大幅に削減する新しいアプローチを紹介します。このアプローチは、アテンションスコアを計算する際に、トークンのごく一部が大部分の価値を提供するという注目すべき観察に基づいています。これらのトークンをHeavy Hitters（H_2）と呼びます。詳細な調査を通じて、(i) H_2の出現は自然であり、テキスト内でのトークンの頻繁な共起と強く相関していること、(ii) それらを除去すると性能が大幅に低下することを明らかにしました。これらの知見に基づき、最近のトークンとH_2トークンのバランスを動的に維持するKVキャッシュの削除ポリシーであるHeavy Hitter Oracle（H_2O）を提案します。KVキャッシュの削除を動的な劣モジュラ問題として定式化し、（穏やかな仮定の下で）新しい削除アルゴリズムの理論的保証を証明しました。これは今後の研究を導くのに役立つ可能性があります。OPT、LLaMA、GPT-NeoXを用いて、幅広いタスクでアルゴリズムの精度を検証しました。20%のHeavy Hittersを用いたH_2Oの実装は、OPT-6.7BとOPT-30Bにおいて、DeepSpeed Zero-Inference、Hugging Face Accelerate、FlexGenという3つの主要な推論システムに対して、それぞれ最大29倍、29倍、3倍のスループット向上をもたらしました。同じバッチサイズで、H2Oは最大1.9倍のレイテンシ削減を実現しました。コードはhttps://github.com/FMInference/H2Oで公開されています。

English

Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H_2). Through a comprehensive investigation, we find that (i) the emergence of H_2 is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (H_2O), a KV cache eviction policy that dynamically retains a balance of recent and H_2 tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of H_2O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29times, 29times, and 3times on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9times. The code is available at https://github.com/FMInference/H2O.

H_2O: 大規模言語モデルの効率的な生成推論のためのヘビーヒッターオラクル

H_2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

要旨

Support