拡散言語モデルのための強化学習の安定化

要旨

Group Relative Policy Optimization（GRPO）は、学習後オート回帰（AR）言語モデルに対して極めて有効であるが、拡散大規模言語モデル（dLLM）への直接適用は報酬崩壊を引き起こすことが多い。我々はこの非互換性の要因を二つ特定した。第一に、GRPOは系列確率に基づく重要度比に依存するが、dLLMではこの確率が計算不能であり、（ELBOベースや平均場近似の尤度代理指標などによる）推定が必要となるため、本質的にノイズの多い比が得られる。第二に、標準GRPOの定式化は推定比を想定しておらず、条件付きクリッピングがモデル非依存の推定ノイジによって異常に回避されて勾配スパイクを生じる一方、固定グループサイズ正規化は高分散な比の推定下で勾配大きさの変動を増幅する。これらの効果が、政策ドリフトを促進し比の分散を更に増大させる自己強化型不安定ループを形成することを示す。このループを断ち切るため、我々はdLLM向けに調整されたGRPOの再定式化であるStableDRLを提案する。これは（i）外れ値誘発スパイクを抑制する無条件クリッピングと、（ii）更新をサンプル単位勾配の凸包内に制限する自己正規化を採用する。さらに、StableDRLを階段状アテンション機構によりブロック単位拡散モデルに拡張する。

English

Group Relative Policy Optimization (GRPO) is highly effective for post-training autoregressive (AR) language models, yet its direct application to diffusion large language models (dLLMs) often triggers reward collapse. We identify two sources of incompatibility. First, GRPO relies on importance ratios defined by sequence probabilities, which are intractable in dLLMs and must be estimated (e.g., via ELBO-based or mean-field likelihood proxies), yielding inherently noisy ratios. Second, standard GRPO's formulation is not designed for estimated ratios: its conditional clipping can be anomalously bypassed by model-agnostic estimation noise, producing gradient spikes, while its fixed group-size normalization amplifies gradient-magnitude fluctuations under high-variance ratio estimates. We show these effects form a self-reinforcing instability loop that drives policy drift and further increases ratio variance. To break this loop, we propose StableDRL, a reformulation of GRPO tailored for dLLMs that uses (i) unconditional clipping to suppress outlier-induced spikes and (ii) self-normalization to constrain updates within the convex hull of per-sample gradients. We further extend StableDRL to block-wise diffusion models via a staircase attention mechanism.

拡散言語モデルのための強化学習の安定化

Stabilizing Reinforcement Learning for Diffusion Language Models

要旨

Support