革新扩散大语言模型的强化学习框架

摘要

我们提出了TraceRL，一种面向扩散语言模型（DLMs）的轨迹感知强化学习框架，该框架将偏好的推理轨迹融入后训练过程，并适用于多种架构。通过配备一个基于扩散的价值模型以增强训练稳定性，我们在复杂的数学和编码任务上展示了推理性能的提升。此外，它还能够应用于将特定模块的模型适配到更大的模块中，从而提高了采样的灵活性。运用TraceRL，我们开发了一系列最先进的扩散语言模型，命名为TraDo。尽管TraDo-4B-Instruct的规模小于7B级别的自回归模型，但在复杂的数学推理任务上，它始终表现更优。TraDo-8B-Instruct在数学推理基准测试中，相较于Qwen2.5-7B-Instruct和Llama3.1-8B-Instruct，分别实现了6.1%和51.3%的相对准确率提升。通过课程学习，我们还推出了首个长链推理扩散语言模型，在MATH500上以18.1%的相对准确率优势超越了Qwen2.5-7B-Instruct。为了促进可重复的研究和实际应用，我们发布了一个全面的开源框架，用于构建、训练和部署跨多种架构的扩散大语言模型。该框架集成了加速的KV缓存技术和推理引擎，适用于推理和强化学习，并包含了针对数学、编码及通用任务的各种监督微调和强化学习方法的实现。代码与模型：https://github.com/Gen-Verse/dLLM-RL

English

We propose TraceRL, a trajectory-aware reinforcement learning framework for diffusion language models (DLMs) that incorporates preferred inference trajectory into post-training, and is applicable across different architectures. Equipped with a diffusion-based value model that enhances training stability, we demonstrate improved reasoning performance on complex math and coding tasks. Besides, it can also be applied to adapt block-specific models to larger blocks, which improves sampling flexibility. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than 7B-scale AR models, TraDo-4B-Instruct still consistently outperforms them across complex math reasoning tasks. TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks. Through curriculum learning, we also derive the first long-CoT DLM, outperforming Qwen2.5-7B-Instruct on MATH500 with an 18.1% relative accuracy gain. To facilitate reproducible research and practical applications, we release a comprehensive open-source framework for building, training, and deploying diffusion LLMs across diverse architectures. The framework integrates accelerated KV-cache techniques and inference engines for both inference and reinforcement learning, and includes implementations of various supervised fine-tuning and RL methods for mathematics, coding, and general tasks. Code and Models: https://github.com/Gen-Verse/dLLM-RL

革新扩散大语言模型的强化学习框架

Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models

摘要

Support