SubGen:次线性时间和内存中的令牌生成
SubGen: Token Generation in Sublinear Time and Memory
February 8, 2024
作者: Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi
cs.AI
摘要
尽管大型语言模型(LLMs)取得了显著成功,但它们庞大的内存需求在长上下文令牌生成中部署时面临挑战。LLM解码器的巨大内存占用量源于需要在注意力模块中存储所有先前的令牌,这是键-值(KV)缓存所施加的要求。本研究的重点是开发一种高效的KV缓存压缩技术。经验证据表明,在注意力模块的键嵌入中存在显著的聚类倾向。基于这一关键洞察,我们设计了一种新颖的缓存方法,具有亚线性复杂度,采用键令牌上的在线聚类和值上的在线ell_2抽样。结果是一个经证明准确且高效的注意力解码算法,称为SubGen。该算法不仅确保亚线性内存占用和亚线性时间复杂度,还为我们的方法建立了严格的误差界限。在长上下文问答任务的实证评估中表明,SubGen在性能和效率方面明显优于现有和最先进的KV缓存压缩方法。
English
Despite the significant success of large language models (LLMs), their
extensive memory requirements pose challenges for deploying them in
long-context token generation. The substantial memory footprint of LLM decoders
arises from the necessity to store all previous tokens in the attention module,
a requirement imposed by key-value (KV) caching. In this work, our focus is on
developing an efficient compression technique for the KV cache. Empirical
evidence indicates a significant clustering tendency within key embeddings in
the attention module. Building on this key insight, we have devised a novel
caching method with sublinear complexity, employing online clustering on key
tokens and online ell_2 sampling on values. The result is a provably
accurate and efficient attention decoding algorithm, termed SubGen. Not only
does this algorithm ensure a sublinear memory footprint and sublinear time
complexity, but we also establish a tight error bound for our approach.
Empirical evaluations on long-context question-answering tasks demonstrate that
SubGen significantly outperforms existing and state-of-the-art KV cache
compression methods in terms of performance and efficiency.