DMax：面向分布式大语言模型的激进并行解码策略

摘要

我们提出DMax——一种高效扩散语言模型（dLLMs）的新范式。该方法通过缓解并行解码中的误差累积问题，在保持生成质量的同时实现了更激进的解码并行度。与传统基于二值掩码到令牌转换的掩码dLLMs不同，DMax将解码过程重新定义为从掩码嵌入到令牌嵌入的渐进式自优化过程。我们的核心创新是策略一致性均匀训练，这种新型训练策略高效统一了掩码与非均匀dLLMs，使模型具备从掩码输入及自身错误预测中恢复正确令牌的能力。在此基础上，我们进一步提出软并行解码技术，将每个中间解码状态表示为预测令牌嵌入与掩码嵌入的插值，从而在嵌入空间实现迭代式自修正。多基准测试表明，DMax方法成效显著：相较于原始LLaDA-2.0-mini模型，在GSM8K数据集上TPF从2.04提升至5.47且准确率保持稳定；在MBPP数据集上TPF从2.71增至5.86的同时维持相当性能；在双H200 GPU环境下，批大小为1时平均每秒处理1,338个令牌。代码已开源：https://github.com/czg1225/DMax

English

We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax