SubGen:次線性時間和記憶體中的 Token 生成
SubGen: Token Generation in Sublinear Time and Memory
February 8, 2024
作者: Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi
cs.AI
摘要
儘管大型語言模型(LLMs)取得了顯著成功,但其龐大的記憶需求使其在長文本生成中的部署面臨挑戰。LLM解碼器的巨大記憶體占用量源於需要在注意力模組中存儲所有先前的標記,這是由鍵-值(KV)緩存所要求的。本研究的重點在於開發一種有效的KV緩存壓縮技術。實證證據表明,在注意力模組中,鍵嵌入內存在顯著的聚類趨勢。基於這一關鍵見解,我們設計了一種具有次線性複雜度的新型緩存方法,採用對鍵標記的在線聚類和對值的在線ell_2抽樣。結果是一種經證明準確且高效的注意力解碼算法,稱為SubGen。這種算法不僅確保了次線性的記憶體占用量和次線性的時間複雜度,而且我們還為我們的方法建立了嚴格的誤差界限。在長文本問答任務的實證評估中,SubGen在性能和效率方面顯著優於現有和最先進的KV緩存壓縮方法。
English
Despite the significant success of large language models (LLMs), their
extensive memory requirements pose challenges for deploying them in
long-context token generation. The substantial memory footprint of LLM decoders
arises from the necessity to store all previous tokens in the attention module,
a requirement imposed by key-value (KV) caching. In this work, our focus is on
developing an efficient compression technique for the KV cache. Empirical
evidence indicates a significant clustering tendency within key embeddings in
the attention module. Building on this key insight, we have devised a novel
caching method with sublinear complexity, employing online clustering on key
tokens and online ell_2 sampling on values. The result is a provably
accurate and efficient attention decoding algorithm, termed SubGen. Not only
does this algorithm ensure a sublinear memory footprint and sublinear time
complexity, but we also establish a tight error bound for our approach.
Empirical evaluations on long-context question-answering tasks demonstrate that
SubGen significantly outperforms existing and state-of-the-art KV cache
compression methods in terms of performance and efficiency.