ChatPaper.aiChatPaper

TidalDecode:具有位置持久稀疏注意力的快速準確LLM解碼

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

October 7, 2024
作者: Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia
cs.AI

摘要

大型語言模型(LLMs)在各種自然語言處理任務中取得了顯著進展,長上下文模型因處理較長輸入而備受矚目。然而,Transformer架構所需的擴展鍵-值(KV)緩存大小加劇了記憶限制,特別是在解碼階段,造成了顯著的瓶頸。現有的旨在解決這一瓶頸的稀疏注意機制存在兩個限制:(1)它們通常無法可靠地識別最相關的注意力標記,以及(2)它們忽略了在連續Transformer層中跨越的標記選擇的空間一致性,這可能導致性能下降和標記選擇方面的重大開銷。本文介紹了TidalDecode,一種簡單而有效的演算法和系統,通過位置持久稀疏注意力實現快速準確的LLM解碼。TidalDecode利用現有稀疏注意機制選擇的標記的空間一致性,並引入一些標記選擇層,進行全注意力以識別具有最高注意力分數的標記,而所有其他層則對預選標記進行稀疏注意力。這種設計使TidalDecode能夠大幅減少稀疏注意力的標記選擇開銷,同時不會犧牲生成結果的質量。對多種LLM和任務的評估表明,TidalDecode在將LLM解碼延遲降低高達2.1倍的同時,與全注意力方法的生成性能相當。
English
Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer architectures intensifies the memory constraints, particularly during the decoding phase, creating a significant bottleneck. Existing sparse attention mechanisms designed to address this bottleneck have two limitations: (1) they often fail to reliably identify the most relevant tokens for attention, and (2) they overlook the spatial coherence of token selection across consecutive Transformer layers, which can lead to performance degradation and substantial overhead in token selection. This paper introduces TidalDecode, a simple yet effective algorithm and system for fast and accurate LLM decoding through position persistent sparse attention. TidalDecode leverages the spatial coherence of tokens selected by existing sparse attention methods and introduces a few token selection layers that perform full attention to identify the tokens with the highest attention scores, while all other layers perform sparse attention with the pre-selected tokens. This design enables TidalDecode to substantially reduce the overhead of token selection for sparse attention without sacrificing the quality of the generated results. Evaluation on a diverse set of LLMs and tasks shows that TidalDecode closely matches the generative performance of full attention methods while reducing the LLM decoding latency by up to 2.1x.

Summary

AI-Generated Summary

PDF82November 16, 2024