수 시간에서 수 분으로: 최대 10만 토큰의 초장기 시퀀스 생성을 위한 무손실 가속화

초록

대규모 언어 모델(LLM)을 사용하여 초장기 시퀀스를 생성하는 것은 점점 더 중요해지고 있지만, 특히 100K 토큰에 이르는 시퀀스의 경우 여전히 시간이 많이 소요되는 작업입니다. 기존의 스펙티브 디코딩(speculative decoding) 방법들이 존재하지만, 단순히 이들의 생성 한계를 확장하는 것은 프로세스를 가속화하지 못할 뿐만 아니라 오히려 해로울 수 있습니다. 심층 분석을 통해 우리는 효율적인 생성을 방해하는 세 가지 주요 문제를 확인했습니다: 빈번한 모델 재로딩, 동적 키-값(KV) 관리, 그리고 반복적인 생성입니다. 이러한 문제를 해결하기 위해, 우리는 TOKENSWIFT라는 새로운 프레임워크를 소개합니다. 이 프레임워크는 초장기 시퀀스의 생성 프로세스를 상당히 가속화하면서도 대상 모델의 본질적인 품질을 유지하도록 설계되었습니다. 실험 결과, TOKENSWIFT는 다양한 규모(1.5B, 7B, 8B, 14B)와 아키텍처(MHA, GQA)의 모델에서 3배 이상의 속도 향상을 달성했습니다. 이 가속화는 초장기 시퀀스 생성에 있어 수 시간의 시간 절약으로 이어지며, TOKENSWIFT를 전례 없는 길이에서도 확장 가능하고 효과적인 솔루션으로 입증합니다. 코드는 https://github.com/bigai-nlco/TokenSwift에서 확인할 수 있습니다.

English

Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.

수 시간에서 수 분으로: 최대 10만 토큰의 초장기 시퀀스 생성을 위한 무손실 가속화

From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

초록

Support