ChatPaper.aiChatPaper

DiRL:一种高效的扩散语言模型后训练框架

DiRL: An Efficient Post-Training Framework for Diffusion Language Models

December 23, 2025
作者: Ying Zhu, Jiaxin Wan, Xiaoran Liu, Siyanag He, Qiqi Wang, Xu Guo, Tianyi Liang, Zengfeng Huang, Ziwei He, Xipeng Qiu
cs.AI

摘要

扩散语言模型(dLLMs)已成为自回归(AR)模型的有前景的替代方案。虽然近期研究验证了其预训练潜力并加速了推理速度,但dLLMs的后训练生态仍不成熟。现有方法存在计算效率低下、训练与推理目标不匹配等问题,严重限制了在数学等复杂推理任务上的性能。为此,我们提出DiRL——一个高效的后训练框架,通过将FlexAttention加速的块状训练与LMDeploy优化的推理紧密集成,构建了精简的在线模型更新循环,支持高效的监督微调与强化学习两阶段后训练。基于该框架,我们提出专为dLLMs设计的首个无偏分组相对策略优化算法DiPO。通过在高质量数学数据上训练DiRL-8B-Instruct,我们的模型在dLLMs中实现了顶尖的数学性能,并在多个基准测试中超越Qwen2.5系列同规模模型。
English
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models. While recent efforts have validated their pre-training potential and accelerated inference speeds, the post-training landscape for dLLMs remains underdeveloped. Existing methods suffer from computational inefficiency and objective mismatches between training and inference, severely limiting performance on complex reasoning tasks such as mathematics. To address this, we introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference. This architecture enables a streamlined online model update loop, facilitating efficient two-stage post-training (Supervised Fine-Tuning followed by Reinforcement Learning). Building on this framework, we propose DiPO, the first unbiased Group Relative Policy Optimization (GRPO) implementation tailored for dLLMs. We validate our approach by training DiRL-8B-Instruct on high-quality math data. Our model achieves state-of-the-art math performance among dLLMs and surpasses comparable models in the Qwen2.5 series on several benchmarks.
PDF161December 31, 2025