LServe: 統一されたスパースアテンションによる効率的な長シーケンスLLMサービング

要旨

大規模言語モデル（LLM）は、長いシーケンスの処理において顕著な可能性を示していますが、長文脈モデルの効率的な提供は、プリフィリング段階におけるアテンションの二次計算複雑性と、デコード段階におけるKVキャッシュの大規模なメモリフットプリントにより、依然として課題となっています。これらの問題に対処するため、我々はハイブリッド疎アテンションを介して長文脈LLM提供を加速する効率的なシステム、LServeを導入します。この手法は、プリフィリングとデコードアテンションの両方に対して、ハードウェアに優しい構造化された疎パターンを単一のフレームワークに統合し、重要度の低いトークンに対する計算をブロック単位でスキップします。LServeは、長文脈LLMアテンションにおける静的および動的疎性の互換性を実証します。この設計により、これらの最適化を組み合わせることで乗算的な高速化が可能となります。具体的には、プリフィリングとデコードの両段階において、アテンションヘッドの半分をほぼ無料のストリーミングヘッドに変換します。さらに、文脈長に関係なく、長文脈能力を維持するためには一定数のKVページのみが必要であることを発見しました。その後、クエリ中心の類似性に基づいてKVページを動的にプルーニングする階層型KVページ選択ポリシーを設計します。平均して、LServeはvLLMに対してプリフィリングを最大2.9倍、デコードを1.3-2.1倍加速し、長文脈の精度を維持します。コードはhttps://github.com/mit-han-lab/omniserveで公開されています。

English

Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at https://github.com/mit-han-lab/omniserve.

LServe: 統一されたスパースアテンションによる効率的な長シーケンスLLMサービング

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

要旨

Support