從數小時到數分鐘:無損加速超長序列生成,最高可達10萬個token
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens
February 26, 2025
作者: Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng
cs.AI
摘要
生成超長序列對於大型語言模型(LLMs)而言已變得日益重要,但這仍是一項極耗時的任務,尤其是在處理高達100K個令牌的序列時。雖然傳統的推測解碼方法存在,但單純地擴展其生成限制並未能加速這一過程,反而可能帶來負面影響。通過深入分析,我們識別出阻礙高效生成的三個主要挑戰:頻繁的模型重載、動態鍵值(KV)管理以及重複生成。為解決這些問題,我們引入了TOKENSWIFT,這是一個新穎的框架,旨在顯著加速超長序列的生成過程,同時保持目標模型固有的質量。實驗結果表明,TOKENSWIFT在不同規模(1.5B、7B、8B、14B)和架構(MHA、GQA)的模型上實現了超過3倍的加速。這一加速轉化為在生成超長序列時節省數小時的時間,使TOKENSWIFT成為在空前長度上可擴展且有效的解決方案。代碼可在https://github.com/bigai-nlco/TokenSwift找到。
English
Generating ultra-long sequences with large language models (LLMs) has become
increasingly crucial but remains a highly time-intensive task, particularly for
sequences up to 100K tokens. While traditional speculative decoding methods
exist, simply extending their generation limits fails to accelerate the process
and can be detrimental. Through an in-depth analysis, we identify three major
challenges hindering efficient generation: frequent model reloading, dynamic
key-value (KV) management and repetitive generation. To address these issues,
we introduce TOKENSWIFT, a novel framework designed to substantially accelerate
the generation process of ultra-long sequences while maintaining the target
model's inherent quality. Experimental results demonstrate that TOKENSWIFT
achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B,
14B) and architectures (MHA, GQA). This acceleration translates to hours of
time savings for ultra-long sequence generation, establishing TOKENSWIFT as a
scalable and effective solution at unprecedented lengths. Code can be found at
https://github.com/bigai-nlco/TokenSwift.Summary
AI-Generated Summary