DARE：扩散大语言模型对齐与强化执行器

摘要

扩散大语言模型（dLLMs）正逐渐成为主流自回归模型的有力替代方案，其通过迭代去噪和并行生成机制取代了严格的序列化标记生成。然而，当前开源生态在模型家族间存在割裂，特别是在后训练流程中，强化学习目标、推演实现和评估脚本往往以论文专用代码库形式发布。这种碎片化现象拖慢了研究迭代速度，增加了复现的工程负担，并导致算法间公平比较困难。我们提出DARE（dLLMs对齐与强化执行器），一个面向dLLMs后训练与评估的开放框架。基于verl~sheng2024hybridflow和OpenCompass~2023opencompass构建的DARE，将监督微调、参数高效微调、偏好优化以及dLLMs专属强化学习统一整合至适用于掩码与块扩散语言模型的共享执行栈中。在涵盖LLaDA、Dream、SDAR和LLaDA2.x等代表性模型家族的测试中，DARE展现出广泛的算法兼容性、可复现的基准评估及实际加速效果。大量实证结果表明，DARE可作为可复用的研究基底，用于开发和比较当前及新兴dLLMs的后训练方法，并推动其实际部署。

English

Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token generation with iterative denoising and parallel generation dynamics. However, their open-source ecosystem remains fragmented across model families and, in particular, across post-training pipelines, where reinforcement learning objectives, rollout implementations and evaluation scripts are often released as paper-specific codebases. This fragmentation slows research iteration, raises the engineering burden of reproduction, and makes fair comparison across algorithms difficult. We present DARE (dLLMs Alignment and Reinforcement Executor), an open framework for post-training and evaluating dLLMs. Built on top of verl~sheng2024hybridflow and OpenCompass~2023opencompass, DARE unifies supervised fine-tuning, parameter-efficient fine-tuning, preference optimization, and dLLM-specific reinforcement learning under a shared execution stack for both masked and block diffusion language models. Across representative model families including LLaDA, Dream, SDAR, and LLaDA2.x, DARE provides broad algorithmic coverage, reproducible benchmark evaluation, and practical acceleration. Extensive empirical results position that DARE serves as a reusable research substrate for developing, comparing, and deploying post-training methods for current and emerging dLLMs.