ChatPaper.aiChatPaper

SubGen:次线性时间和内存中的令牌生成

SubGen: Token Generation in Sublinear Time and Memory

February 8, 2024
作者: Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi
cs.AI

摘要

尽管大型语言模型(LLMs)取得了显著成功,但它们庞大的内存需求在长上下文令牌生成中部署时面临挑战。LLM解码器的巨大内存占用量源于需要在注意力模块中存储所有先前的令牌,这是键-值(KV)缓存所施加的要求。本研究的重点是开发一种高效的KV缓存压缩技术。经验证据表明,在注意力模块的键嵌入中存在显著的聚类倾向。基于这一关键洞察,我们设计了一种新颖的缓存方法,具有亚线性复杂度,采用键令牌上的在线聚类和值上的在线ell_2抽样。结果是一个经证明准确且高效的注意力解码算法,称为SubGen。该算法不仅确保亚线性内存占用和亚线性时间复杂度,还为我们的方法建立了严格的误差界限。在长上下文问答任务的实证评估中表明,SubGen在性能和效率方面明显优于现有和最先进的KV缓存压缩方法。
English
Despite the significant success of large language models (LLMs), their extensive memory requirements pose challenges for deploying them in long-context token generation. The substantial memory footprint of LLM decoders arises from the necessity to store all previous tokens in the attention module, a requirement imposed by key-value (KV) caching. In this work, our focus is on developing an efficient compression technique for the KV cache. Empirical evidence indicates a significant clustering tendency within key embeddings in the attention module. Building on this key insight, we have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online ell_2 sampling on values. The result is a provably accurate and efficient attention decoding algorithm, termed SubGen. Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach. Empirical evaluations on long-context question-answering tasks demonstrate that SubGen significantly outperforms existing and state-of-the-art KV cache compression methods in terms of performance and efficiency.
PDF122December 15, 2024