DMax：面向分布式大语言模型的激进并行解码策略

摘要

我们提出DMax——一种高效扩散语言模型（dLLMs）的新范式。该范式通过缓解并行解码中的误差累积问题，在保持生成质量的同时实现了更具侵略性的解码并行度。与传统基于二元掩码到令牌转换的掩码dLLMs不同，DMax将解码过程重新定义为从掩码嵌入到令牌嵌入的渐进式自优化过程。我们方法的核心是"在策略均匀训练"——一种新颖的训练策略，能高效统一掩码型与均匀型dLLMs，使模型具备从掩码输入及自身错误预测中恢复纯净令牌的能力。在此基础上，我们进一步提出软并行解码技术，将每个中间解码状态表示为预测令牌嵌入与掩码嵌入的插值，从而实现嵌入空间的迭代自修正。大量基准测试实验验证了DMax的有效性：相较于原始LLaDA-2.0-mini模型，本方法在GSM8K数据集上将每帧令牌处理量（TPF）从2.04提升至5.47且保持准确率不变；在MBPP数据集上TPF从2.71增至5.86的同时维持相当性能；在双H200 GPU环境下，批大小为1时平均每秒可处理1338个令牌。代码已开源：https://github.com/czg1225/DMax

English

We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax

DMax：面向分布式大语言模型的激进并行解码策略

DMax: Aggressive Parallel Decoding for dLLMs

摘要

Support