LoPA:基于前瞻并行解码的大语言模型推理扩展技术
LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding
December 18, 2025
作者: Chenkai Xu, Yijie Jin, Jiajun Li, Yi Tu, Guoping Long, Dandan Tu, Mingcong Song, Hongjie Si, Tianqi Hou, Junchi Yan, Zhijie Deng
cs.AI
摘要
扩散大语言模型(dLLMs)已展现出高速推理的重要潜力。然而当前基于置信度的解码策略受限于并行度不足,通常单次前向传播(TPF)仅能生成1-3个词元。本研究首次发现dLLM推理过程中的并行度对词元填充顺序(TFO)具有高度敏感性,据此提出无需训练即插即用的前瞻并行解码算法LoPA。该算法通过并行分支同步探索不同候选TFO,并基于分支置信度筛选最具并行潜力的路径。将LoPA应用于前沿的D2F模型后,解码效率获得显著提升:在GSM8K数据集上,D2F-Dream模型的TPF提升至10.1,同时性能仍优于Dream基线。为支撑这一突破性并行度,我们开发了具备分支并行(BP)特性的多设备推理系统,在多GPU部署下实现了单样本1073.9词元/秒的吞吐量。代码已开源:https://github.com/zhijie-group/LoPA。
English
Diffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1--3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at https://github.com/zhijie-group/LoPA.