MInference 1.0: 動的スパースアテンションによる長文脈LLMの事前埋め込み高速化

要旨

大規模言語モデル（LLM）の推論における計算上の課題は、特にプロンプトの長さが増加し続ける中で、その広範な展開に対する大きな障壁となっています。アテンション計算の二次的な複雑さのため、8BのLLMが1Mトークンのプロンプト（つまり、プリフィリング段階）を単一のA100 GPUで処理するのに30分かかります。既存のプリフィリング高速化手法は、長文脈LLMに適用した場合、許容可能な精度や効率を維持できないことが多いです。このギャップを埋めるため、我々はMInference（Milliontokens Inference）を導入しました。これは、長文シーケンス処理のプリフィリングを加速するためのスパース計算手法です。具体的には、長文脈アテンションマトリックスにおける3つの独特なパターン（A字型、垂直スラッシュ、ブロックスパース）を特定し、GPU上での効率的なスパース計算に活用します。各アテンションヘッドに対して最適なパターンをオフラインで決定し、推論時に割り当てられたパターンに基づいてスパースインデックスを動的に構築します。このパターンとスパースインデックスを用いて、最適化されたGPUカーネルを通じて効率的なスパースアテンション計算を実行し、長文脈LLMのプリフィリング段階のレイテンシを大幅に削減します。提案手法は、既存のLLMに直接適用可能で、事前学習設定の変更や追加のファインチューニングを必要としません。InfiniteBench、RULER、PG-19、Needle In A Haystackなどの多様な下流タスク、およびLLaMA-3-1M、GLM4-1M、Yi-200K、Phi-3-128K、Qwen2-128Kなどのモデルを用いて評価を行った結果、MInferenceがA100上でのプリフィリングの推論レイテンシを最大10倍削減しつつ、精度を維持することを実証しました。コードはhttps://aka.ms/MInferenceで公開されています。

English

The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https://aka.ms/MInference.

MInference 1.0: 動的スパースアテンションによる長文脈LLMの事前埋め込み高速化

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

要旨

Support