TidalDecode:具有位置持久稀疏注意力的快速准确LLM解码
TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
October 7, 2024
作者: Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia
cs.AI
摘要
大型语言模型(LLMs)推动了自然语言处理各种任务的重大进展,长上下文模型因处理更长输入而备受关注。然而,Transformer架构所需的扩展键-值(KV)缓存大小加剧了内存限制,特别是在解码阶段,造成了显著瓶颈。现有旨在解决这一瓶颈的稀疏注意力机制存在两个限制:(1)它们通常无法可靠地识别最相关的注意力标记,以及(2)它们忽视了连续Transformer层中标记选择的空间连贯性,这可能导致性能下降和标记选择方面的重大开销。本文介绍了TidalDecode,这是一种简单而有效的算法和系统,通过位置持久的稀疏注意力实现了快速准确的LLM解码。TidalDecode利用现有稀疏注意力方法选择的标记的空间连贯性,并引入了一些标记选择层,执行全注意力以识别具有最高注意力分数的标记,而所有其他层则使用预先选择的标记执行稀疏注意力。这种设计使TidalDecode能够大幅减少稀疏注意力的标记选择开销,而不会牺牲生成结果的质量。在各种LLMs和任务上的评估显示,TidalDecode在减少LLM解码延迟高达2.1倍的同时,与全注意力方法的生成性能相当。
English
Large language models (LLMs) have driven significant advancements across
diverse NLP tasks, with long-context models gaining prominence for handling
extended inputs. However, the expanding key-value (KV) cache size required by
Transformer architectures intensifies the memory constraints, particularly
during the decoding phase, creating a significant bottleneck. Existing sparse
attention mechanisms designed to address this bottleneck have two limitations:
(1) they often fail to reliably identify the most relevant tokens for
attention, and (2) they overlook the spatial coherence of token selection
across consecutive Transformer layers, which can lead to performance
degradation and substantial overhead in token selection. This paper introduces
TidalDecode, a simple yet effective algorithm and system for fast and accurate
LLM decoding through position persistent sparse attention. TidalDecode
leverages the spatial coherence of tokens selected by existing sparse attention
methods and introduces a few token selection layers that perform full attention
to identify the tokens with the highest attention scores, while all other
layers perform sparse attention with the pre-selected tokens. This design
enables TidalDecode to substantially reduce the overhead of token selection for
sparse attention without sacrificing the quality of the generated results.
Evaluation on a diverse set of LLMs and tasks shows that TidalDecode closely
matches the generative performance of full attention methods while reducing the
LLM decoding latency by up to 2.1x.Summary
AI-Generated Summary