数時間から数分へ：最大10万トークンの超長シーケンス生成をロスレスで高速化

要旨

大規模言語モデル（LLM）を用いた超長シーケンスの生成は、ますます重要になっているが、特に100Kトークンに及ぶシーケンスにおいては、依然として非常に時間のかかるタスクである。従来の推測的デコード手法は存在するものの、単にその生成限界を拡張してもプロセスを加速することはできず、むしろ逆効果となる可能性がある。詳細な分析を通じて、効率的な生成を妨げる3つの主要な課題を特定した：頻繁なモデルの再読み込み、動的なキー・バリュー（KV）管理、そして繰り返し生成である。これらの問題に対処するため、TOKENSWIFTという新しいフレームワークを導入し、ターゲットモデルの本来の品質を維持しながら、超長シーケンスの生成プロセスを大幅に加速することを目指した。実験結果は、TOKENSWIFTが様々なスケール（1.5B、7B、8B、14B）とアーキテクチャ（MHA、GQA）のモデルにおいて、3倍以上の高速化を達成することを示している。この加速により、超長シーケンス生成における時間の大幅な節約が実現され、TOKENSWIFTは前例のない長さにおいてもスケーラブルで効果的なソリューションとして確立された。コードはhttps://github.com/bigai-nlco/TokenSwiftで公開されている。

English

Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.

数時間から数分へ：最大10万トークンの超長シーケンス生成をロスレスで高速化

From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

要旨

Support