Hydragen: 共有プレフィックスを用いた高スループットLLM推論

要旨

Transformerベースの大規模言語モデル（LLM）は現在、数億人のユーザーに展開されています。LLMの推論は、通常、few-shotの例やチャットボットシステムのプロンプトなど、共通のプレフィックスを持つシーケンスのバッチで実行されます。この大規模バッチ設定でのデコードは、メモリから大規模なキー・バリュー（KV）キャッシュを読み取り、バッチ内のすべてのシーケンスに対して非効率な行列-ベクトル積を計算するアテンション操作によってボトルネックとなることがあります。本研究では、共有プレフィックスを持つアテンションのハードウェアを意識した正確な実装であるHydragenを紹介します。Hydragenは、共有プレフィックスとユニークなサフィックスに対して別々にアテンションを計算します。この分解により、シーケンス間でクエリをバッチ処理することで効率的なプレフィックスアテンションを実現し、冗長なメモリ読み取りを削減し、ハードウェアに適した行列乗算の使用を可能にします。私たちの手法は、競合するベースラインに対して最大32倍のエンドツーエンドのLLMスループット向上をもたらし、バッチサイズと共有プレフィックスの長さが増えるほど速度が向上します。Hydragenはまた、非常に長い共有コンテキストの使用を可能にします。高バッチサイズでは、プレフィックスの長さを1Kトークンから16Kトークンに増やしても、Hydragenのスループットは15%未満しか低下しませんが、ベースラインのスループットは90%以上低下します。Hydragenは単純なプレフィックス-サフィックス分解を超えて一般化され、ツリーベースのプロンプト共有パターンにも適用でき、競技プログラミング問題での推論時間をさらに55%削減することができます。

English

Transformer-based large language models (LLMs) are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decoding in this large-batch setting can be bottlenecked by the attention operation, which reads large key-value (KV) caches from memory and computes inefficient matrix-vector products for every sequence in the batch. In this work, we introduce Hydragen, a hardware-aware exact implementation of attention with shared prefixes. Hydragen computes attention over the shared prefix and unique suffixes separately. This decomposition enables efficient prefix attention by batching queries together across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Our method can improve end-to-end LLM throughput by up to 32x against competitive baselines, with speedup growing with the batch size and shared prefix length. Hydragen also enables the use of very long shared contexts: with a high batch size, increasing the prefix length from 1K to 16K tokens decreases Hydragen throughput by less than 15%, while the throughput of baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, allowing us to further reduce inference time on competitive programming problems by 55%.

Hydragen: 共有プレフィックスを用いた高スループットLLM推論

Hydragen: High-Throughput LLM Inference with Shared Prefixes

要旨

Support