ChatPaper.aiChatPaper

革新扩散大语言模型的强化学习框架

Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

September 8, 2025
作者: Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, Mengdi Wang
cs.AI

摘要

我们提出了TraceRL,一种面向扩散语言模型(DLMs)的轨迹感知强化学习框架,该框架将偏好的推理轨迹融入后训练过程,并适用于多种架构。通过配备一个基于扩散的价值模型以增强训练稳定性,我们在复杂的数学和编码任务上展示了推理性能的提升。此外,它还能够应用于将特定模块的模型适配到更大的模块中,从而提高了采样的灵活性。运用TraceRL,我们开发了一系列最先进的扩散语言模型,命名为TraDo。尽管TraDo-4B-Instruct的规模小于7B级别的自回归模型,但在复杂的数学推理任务上,它始终表现更优。TraDo-8B-Instruct在数学推理基准测试中,相较于Qwen2.5-7B-Instruct和Llama3.1-8B-Instruct,分别实现了6.1%和51.3%的相对准确率提升。通过课程学习,我们还推出了首个长链推理扩散语言模型,在MATH500上以18.1%的相对准确率优势超越了Qwen2.5-7B-Instruct。为了促进可重复的研究和实际应用,我们发布了一个全面的开源框架,用于构建、训练和部署跨多种架构的扩散大语言模型。该框架集成了加速的KV缓存技术和推理引擎,适用于推理和强化学习,并包含了针对数学、编码及通用任务的各种监督微调和强化学习方法的实现。代码与模型:https://github.com/Gen-Verse/dLLM-RL
English
We propose TraceRL, a trajectory-aware reinforcement learning framework for diffusion language models (DLMs) that incorporates preferred inference trajectory into post-training, and is applicable across different architectures. Equipped with a diffusion-based value model that enhances training stability, we demonstrate improved reasoning performance on complex math and coding tasks. Besides, it can also be applied to adapt block-specific models to larger blocks, which improves sampling flexibility. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than 7B-scale AR models, TraDo-4B-Instruct still consistently outperforms them across complex math reasoning tasks. TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks. Through curriculum learning, we also derive the first long-CoT DLM, outperforming Qwen2.5-7B-Instruct on MATH500 with an 18.1% relative accuracy gain. To facilitate reproducible research and practical applications, we release a comprehensive open-source framework for building, training, and deploying diffusion LLMs across diverse architectures. The framework integrates accelerated KV-cache techniques and inference engines for both inference and reinforcement learning, and includes implementations of various supervised fine-tuning and RL methods for mathematics, coding, and general tasks. Code and Models: https://github.com/Gen-Verse/dLLM-RL
PDF505September 9, 2025