ChatPaper.aiChatPaper

TidalDecode:具有位置持久稀疏注意力的快速准确LLM解码

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

October 7, 2024
作者: Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia
cs.AI

摘要

大型语言模型(LLMs)推动了自然语言处理各种任务的重大进展,长上下文模型因处理更长输入而备受关注。然而,Transformer架构所需的扩展键-值(KV)缓存大小加剧了内存限制,特别是在解码阶段,造成了显著瓶颈。现有旨在解决这一瓶颈的稀疏注意力机制存在两个限制:(1)它们通常无法可靠地识别最相关的注意力标记,以及(2)它们忽视了连续Transformer层中标记选择的空间连贯性,这可能导致性能下降和标记选择方面的重大开销。本文介绍了TidalDecode,这是一种简单而有效的算法和系统,通过位置持久的稀疏注意力实现了快速准确的LLM解码。TidalDecode利用现有稀疏注意力方法选择的标记的空间连贯性,并引入了一些标记选择层,执行全注意力以识别具有最高注意力分数的标记,而所有其他层则使用预先选择的标记执行稀疏注意力。这种设计使TidalDecode能够大幅减少稀疏注意力的标记选择开销,而不会牺牲生成结果的质量。在各种LLMs和任务上的评估显示,TidalDecode在减少LLM解码延迟高达2.1倍的同时,与全注意力方法的生成性能相当。
English
Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection. This paper introduces TidalDecode, a simple yet effective algorithm and system for fast and accurate LLM decoding through position persistent sparse attention. TidalDecode leverages the spatial coherence of tokens selected by existing sparse attention methods and introduces a few token selection layers that perform full attention to identify the tokens with the highest attention scores, while all other layers perform sparse attention with the pre-selected tokens. This design enables TidalDecode to substantially reduce the overhead of token selection for sparse attention without sacrificing the quality of the generated results. Evaluation on a diverse set of LLMs and tasks shows that TidalDecode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1x.

Summary

AI-Generated Summary

PDF82November 16, 2024