MInference 1.0：透過動態稀疏注意力加速長文本上下文LLM的預填充

摘要

大型語言模型（LLM）推理的計算挑戰仍然是廣泛部署的一個重要障礙，尤其是在提示長度不斷增加的情況下。由於注意力計算的二次複雜度，一個8B的LLM在單個A100 GPU上處理1M個標記的提示（即預填充階段）需要30分鐘。現有的加速預填充的方法在應用於長內容LLM時往往無法保持可接受的準確性或效率。為了解決這一問題，我們引入了MInference（百萬標記推理），這是一種稀疏計算方法，旨在加速長序列處理的預填充。具體來說，我們識別了長內容注意力矩陣中的三種獨特模式- A形、垂直斜線和塊狀稀疏，這些模式可以利用GPU上的高效稀疏計算。我們在線下確定每個注意力頭的最佳模式，並根據分配的模式在推理期間動態構建稀疏索引。通過這些模式和稀疏索引，我們通過我們優化的GPU核心執行高效的稀疏注意力計算，從而顯著降低長內容LLM預填充階段的延遲。我們提出的技術可以直接應用於現有的LLM，無需對預訓練設置進行任何修改或進行額外的微調。通過在各種下游任務上進行評估，包括InfiniteBench、RULER、PG-19和Needle In A Haystack，以及LLaMA-3-1M、GLM4-1M、Yi-200K、Phi-3-128K和Qwen2-128K等模型，我們證明MInference可以將A100上預填充的推理延遲有效降低多達10倍，同時保持準確性。我們的代碼可在https://aka.ms/MInference 上找到。

English

The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https://aka.ms/MInference.

MInference 1.0：透過動態稀疏注意力加速長文本上下文LLM的預填充

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

摘要

Support