DuoAttention:具檢索和串流頭的高效長文本LLM推論
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
October 14, 2024
作者: Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han
cs.AI
摘要
部署長內文大型語言模型(LLMs)是必要的,但也帶來了重大的計算和記憶體挑戰。在所有注意力頭部擷取的所有Key和Value(KV)狀態會消耗大量記憶體。現有的KV快取修剪方法可能損害LLMs的長內文功能,或僅提供有限的效率改進。在本文中,我們確定只有一小部分注意力頭部,即檢索頭部,對處理長內文至關重要,需要對所有標記進行全面關注。相反,所有其他頭部主要關注最近的標記和注意力汲取,被稱為流頭部,不需要全面關注。基於這一見解,我們引入了DuoAttention,這是一個框架,僅對檢索頭部應用完整的KV快取,同時對流頭部使用輕量級、固定長度的KV快取,從而減少了LLMs的解碼和預填充記憶體和延遲,而不會影響其長內文功能。DuoAttention使用輕量級、基於優化的算法與合成數據準確識別檢索頭部。我們的方法可將MHA模型的長內文推理記憶體減少高達2.55倍,GQA模型減少1.67倍,同時將解碼加速高達2.18倍和1.50倍,並將預填充加速高達1.73倍和1.63倍,相對於全面關注,幾乎沒有精度損失。值得注意的是,結合量化,DuoAttention使Llama-3-8B在單個A100 GPU上能夠解碼330萬內文長度。代碼提供在https://github.com/mit-han-lab/duo-attention。
English
Deploying long-context large language models (LLMs) is essential but poses
significant computational and memory challenges. Caching all Key and Value (KV)
states across all attention heads consumes substantial memory. Existing KV
cache pruning methods either damage the long-context capabilities of LLMs or
offer only limited efficiency improvements. In this paper, we identify that
only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for
processing long contexts and require full attention across all tokens. In
contrast, all other heads, which primarily focus on recent tokens and attention
sinks--referred to as Streaming Heads--do not require full attention. Based on
this insight, we introduce DuoAttention, a framework that only applies a full
KV cache to retrieval heads while using a light-weight, constant-length KV
cache for streaming heads, which reduces both LLM's decoding and pre-filling
memory and latency without compromising its long-context abilities.
DuoAttention uses a lightweight, optimization-based algorithm with synthetic
data to identify retrieval heads accurately. Our method significantly reduces
long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models
while speeding up decoding by up to 2.18x and 1.50x and accelerating
pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with
minimal accuracy loss compared to full attention. Notably, combined with
quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context
length on a single A100 GPU. Code is provided in
https://github.com/mit-han-lab/duo-attention.Summary
AI-Generated Summary