DARE: 拡散大規模言語モデルアライメント強化学習エグゼキューター

要旨

拡散大規模言語モデル（dLLM）は、主流の自己回帰モデルに代わる魅力的な選択肢として台頭しつつあり、厳密に逐次的なトークン生成を反復的なノイズ除去と並列生成のダイナミクスで置き換えている。しかし、そのオープンソースエコシステムは、特に強化学習の目的関数、ロールアウトの実装、評価スクリプトが論文固有のコードベースとして公開されることが多い、モデルファミリー間、そしてポストトレーニングパイプライン間で断片化されたままである。この断片化は研究の反復を遅らせ、再現のための工学的負荷を高め、アルゴリズム間の公平な比較を困難にしている。本研究では、dLLMのポストトレーニングと評価のためのオープンなフレームワークであるDARE（dLLMs Alignment and Reinforcement Executor）を提案する。DAREは、verl~sheng2024hybridflow と OpenCompass~2023opencompass を基盤として構築され、マスク拡散言語モデルとブロック拡散言語モデルの双方に対して、教師ありファインチューニング、パラメータ効率的ファインチューニング、選好最適化、そしてdLLM特有の強化学習を共通の実行スタックの下に統合する。LLaDA、Dream、SDAR、LLaDA2.x などの代表的なモデルファミリーにわたり、DAREは広範なアルゴリズムのカバレッジ、再現可能なベンチマーク評価、実用的な高速化を提供する。大規模な実証結果は、DAREが、現在および新興のdLLM向けのポストトレーニング手法を開発、比較、展開するための再利用可能な研究基盤として機能することを示している。

English

Diffusion large language models (dLLMs) are emerging as a compelling alternative to dominant autoregressive models, replacing strictly sequential token generation with iterative denoising and parallel generation dynamics. However, their open-source ecosystem remains fragmented across model families and, in particular, across post-training pipelines, where reinforcement learning objectives, rollout implementations and evaluation scripts are often released as paper-specific codebases. This fragmentation slows research iteration, raises the engineering burden of reproduction, and makes fair comparison across algorithms difficult. We present DARE (dLLMs Alignment and Reinforcement Executor), an open framework for post-training and evaluating dLLMs. Built on top of verl~sheng2024hybridflow and OpenCompass~2023opencompass, DARE unifies supervised fine-tuning, parameter-efficient fine-tuning, preference optimization, and dLLM-specific reinforcement learning under a shared execution stack for both masked and block diffusion language models. Across representative model families including LLaDA, Dream, SDAR, and LLaDA2.x, DARE provides broad algorithmic coverage, reproducible benchmark evaluation, and practical acceleration. Extensive empirical results position that DARE serves as a reusable research substrate for developing, comparing, and deploying post-training methods for current and emerging dLLMs.

DARE: 拡散大規模言語モデルアライメント強化学習エグゼキューター

DARE: Diffusion Large Language Models Alignment and Reinforcement Executor

要旨

Support