LoPA: Lookahead 병렬 디코딩을 통한 dLLM 추론 확장

초록

확산 대형 언어 모델(dLLM)은 고속 추론에서 상당한 잠재력을 입증했습니다. 그러나 현재의 신뢰도 기반 디코딩 전략은 제한된 병렬성에 의해 제약을 받아, 일반적으로 순전파 패스당(TPF) 1~3개의 토큰만 처리합니다. 본 연구에서는 dLLM 추론 시 병렬성 정도가 토큰 채우기 순서(TFO)에 매우 민감함을 규명합니다. 이에 따라 학습 없이 즉시 적용 가능한 플러그인 알고리즘인 Lookahead PArallel Decoding(LoPA)를 제안하여 더 우수한 TFO를 식별함으로써 추론 속도를 가속화합니다. LoPA는 병렬 브랜치를 통해 서로 다른 후보 TFO들을 동시에 탐색하고, 브랜치 신뢰도를 기반으로 향후 병렬성 잠재력이 가장 높은 순서를 선택합니다. 최첨단 D2F 모델에 LoPA를 적용한 결과 디코딩 효율이 크게 향상되었습니다. 특히 LoPA는 D2F-Dream의 TPF를 GSM8K에서 10.1로 향상시키면서도 Dream 기준 모델보다 우수한 성능을 유지했습니다. 더 나아가 이처럼 전례 없는 수준의 병렬성을 지원하기 위해 브랜치 병렬 처리(BP)를 특징으로 하는 전문 다중 장치 추론 시스템을 개발했으며, 다중 GPU 환경에서 샘플당 초당 1073.9 토큰의 처리량을 달성했습니다. 코드는 https://github.com/zhijie-group/LoPA에서 확인할 수 있습니다.

English

Diffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1--3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at https://github.com/zhijie-group/LoPA.

LoPA: Lookahead 병렬 디코딩을 통한 dLLM 추론 확장

LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding

초록

Support