ChatPaper.aiChatPaper

DuoAttention:检索和流式头结合的高效长上下文LLM推理

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

October 14, 2024
作者: Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han
cs.AI

摘要

部署长上下文大型语言模型(LLMs)至关重要,但面临着重大的计算和内存挑战。在所有注意力头部中缓存所有关键和值(KV)状态会消耗大量内存。现有的KV缓存修剪方法要么损害LLMs的长上下文能力,要么仅提供有限的效率改进。在本文中,我们确定只有一小部分注意力头部,即检索头部,对处理长上下文至关重要,并需要对所有标记进行全注意力。相反,所有其他头部,主要关注最近的标记和注意力汇聚,即流头部,不需要全注意力。基于这一观点,我们引入了DuoAttention,这是一个框架,仅对检索头部应用完整的KV缓存,同时对流头部使用轻量级、固定长度的KV缓存,从而减少LLMs的解码和预填充内存和延迟,而不损害其长上下文能力。DuoAttention使用轻量级、基于优化的算法与合成数据准确识别检索头部。我们的方法显著减少了MHA模型长上下文推理内存高达2.55倍,GQA模型为1.67倍,同时将解码加速高达2.18倍和1.50倍,预填充加速高达1.73倍和1.63倍,相对于全注意力,准确性损失最小。值得注意的是,结合量化,DuoAttention使得在单个A100 GPU上能够对Llama-3-8B进行3.3百万上下文长度的解码。代码可在https://github.com/mit-han-lab/duo-attention找到。
English
Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks--referred to as Streaming Heads--do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU. Code is provided in https://github.com/mit-han-lab/duo-attention.

Summary

AI-Generated Summary

PDF72November 16, 2024