DiRL：一种高效的扩散语言模型后训练框架

摘要

扩散语言模型（dLLMs）已成为自回归模型的重要替代方案。尽管近期研究验证了其预训练潜力并提升了推理速度，但dLLMs的后训练体系仍不成熟。现有方法存在计算效率低下、训练与推理目标不匹配等问题，严重限制了在数学等复杂推理任务上的性能。为此，我们提出DiRL——一种高效后训练框架，通过将FlexAttention加速的块状训练与LMDeploy优化的推理紧密集成，构建了精简的在线模型更新循环，实现高效的两阶段后训练（监督微调后接强化学习）。基于此框架，我们提出专为dLLMs设计的首个无偏分组相对策略优化算法DiPO。通过使用高质量数学数据训练DiRL-8B-Instruct模型进行验证，该模型在dLLMs中取得最先进的数学推理性能，并在多个基准测试中超越Qwen2.5系列同规模模型。

English

Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models. While recent efforts have validated their pre-training potential and accelerated inference speeds, the post-training landscape for dLLMs remains underdeveloped. Existing methods suffer from computational inefficiency and objective mismatches between training and inference, severely limiting performance on complex reasoning tasks such as mathematics. To address this, we introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference. This architecture enables a streamlined online model update loop, facilitating efficient two-stage post-training (Supervised Fine-Tuning followed by Reinforcement Learning). Building on this framework, we propose DiPO, the first unbiased Group Relative Policy Optimization (GRPO) implementation tailored for dLLMs. We validate our approach by training DiRL-8B-Instruct on high-quality math data. Our model achieves state-of-the-art math performance among dLLMs and surpasses comparable models in the Qwen2.5 series on several benchmarks.