革新扩散大语言模型的强化学习框架
Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models
September 8, 2025
作者: Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, Mengdi Wang
cs.AI
摘要
我们提出了TraceRL,一种面向扩散语言模型(DLMs)的轨迹感知强化学习框架,该框架将偏好的推理轨迹融入后训练过程,并适用于多种架构。通过配备一个基于扩散的价值模型以增强训练稳定性,我们在复杂的数学和编码任务上展示了推理性能的提升。此外,它还能够应用于将特定模块的模型适配到更大的模块中,从而提高了采样的灵活性。运用TraceRL,我们开发了一系列最先进的扩散语言模型,命名为TraDo。尽管TraDo-4B-Instruct的规模小于7B级别的自回归模型,但在复杂的数学推理任务上,它始终表现更优。TraDo-8B-Instruct在数学推理基准测试中,相较于Qwen2.5-7B-Instruct和Llama3.1-8B-Instruct,分别实现了6.1%和51.3%的相对准确率提升。通过课程学习,我们还推出了首个长链推理扩散语言模型,在MATH500上以18.1%的相对准确率优势超越了Qwen2.5-7B-Instruct。为了促进可重复的研究和实际应用,我们发布了一个全面的开源框架,用于构建、训练和部署跨多种架构的扩散大语言模型。该框架集成了加速的KV缓存技术和推理引擎,适用于推理和强化学习,并包含了针对数学、编码及通用任务的各种监督微调和强化学习方法的实现。代码与模型:https://github.com/Gen-Verse/dLLM-RL
English
We propose TraceRL, a trajectory-aware reinforcement learning framework for
diffusion language models (DLMs) that incorporates preferred inference
trajectory into post-training, and is applicable across different
architectures. Equipped with a diffusion-based value model that enhances
training stability, we demonstrate improved reasoning performance on complex
math and coding tasks. Besides, it can also be applied to adapt block-specific
models to larger blocks, which improves sampling flexibility. Employing
TraceRL, we derive a series of state-of-the-art diffusion language models,
namely TraDo. Although smaller than 7B-scale AR models, TraDo-4B-Instruct still
consistently outperforms them across complex math reasoning tasks.
TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over
Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical
reasoning benchmarks. Through curriculum learning, we also derive the first
long-CoT DLM, outperforming Qwen2.5-7B-Instruct on MATH500 with an 18.1%
relative accuracy gain. To facilitate reproducible research and practical
applications, we release a comprehensive open-source framework for building,
training, and deploying diffusion LLMs across diverse architectures. The
framework integrates accelerated KV-cache techniques and inference engines for
both inference and reinforcement learning, and includes implementations of
various supervised fine-tuning and RL methods for mathematics, coding, and
general tasks. Code and Models: https://github.com/Gen-Verse/dLLM-RL