DuoAttention：具檢索和串流頭的高效長文本LLM推論

摘要

部署長內文大型語言模型（LLMs）是必要的，但也帶來了重大的計算和記憶體挑戰。在所有注意力頭部擷取的所有Key和Value（KV）狀態會消耗大量記憶體。現有的KV快取修剪方法可能損害LLMs的長內文功能，或僅提供有限的效率改進。在本文中，我們確定只有一小部分注意力頭部，即檢索頭部，對處理長內文至關重要，需要對所有標記進行全面關注。相反，所有其他頭部主要關注最近的標記和注意力汲取，被稱為流頭部，不需要全面關注。基於這一見解，我們引入了DuoAttention，這是一個框架，僅對檢索頭部應用完整的KV快取，同時對流頭部使用輕量級、固定長度的KV快取，從而減少了LLMs的解碼和預填充記憶體和延遲，而不會影響其長內文功能。DuoAttention使用輕量級、基於優化的算法與合成數據準確識別檢索頭部。我們的方法可將MHA模型的長內文推理記憶體減少高達2.55倍，GQA模型減少1.67倍，同時將解碼加速高達2.18倍和1.50倍，並將預填充加速高達1.73倍和1.63倍，相對於全面關注，幾乎沒有精度損失。值得注意的是，結合量化，DuoAttention使Llama-3-8B在單個A100 GPU上能夠解碼330萬內文長度。代碼提供在https://github.com/mit-han-lab/duo-attention。

English

Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks--referred to as Streaming Heads--do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU. Code is provided in https://github.com/mit-han-lab/duo-attention.

DuoAttention：具檢索和串流頭的高效長文本LLM推論

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

摘要

Support