擴散語言模型的強化學習穩定性研究

摘要

群組相對策略優化（GRPO）在後訓練自迴歸語言模型上表現卓越，但直接應用於擴散式大型語言模型時常引發獎勵崩潰。我們發現兩個不相容的根源：首先，GRPO依賴由序列機率定義的重要性比率，而此機率在擴散式大型語言模型中難以精確計算，需透過估計（如基於ELBO或平均場似然代理值）獲得，導致比率本質上存在噪聲。其次，標準GRPO的公式設計未考慮估計比率——其條件式剪裁可能被與模型無關的估計噪聲異常繞過，產生梯度尖峰；而固定群組大小的歸一化機制在高方差比率估計下會放大梯度幅度波動。我們證明這些效應會形成自我強化的不穩定循環，加劇策略偏移並進一步增加比率方差。為打破此循環，我們提出StableDRL：一種針對擴散式大型語言模型改寫的GRPO框架，採用（i）無條件剪裁以抑制異常值引發的梯度尖峰，以及（ii）自歸一化將更新約束於每樣本梯度的凸包內。我們更透過階梯式注意力機制將StableDRL擴展至區塊級擴散模型。

English

Group Relative Policy Optimization (GRPO) is highly effective for post-training autoregressive (AR) language models, yet its direct application to diffusion large language models (dLLMs) often triggers reward collapse. We identify two sources of incompatibility. First, GRPO relies on importance ratios defined by sequence probabilities, which are intractable in dLLMs and must be estimated (e.g., via ELBO-based or mean-field likelihood proxies), yielding inherently noisy ratios. Second, standard GRPO's formulation is not designed for estimated ratios: its conditional clipping can be anomalously bypassed by model-agnostic estimation noise, producing gradient spikes, while its fixed group-size normalization amplifies gradient-magnitude fluctuations under high-variance ratio estimates. We show these effects form a self-reinforcing instability loop that drives policy drift and further increases ratio variance. To break this loop, we propose StableDRL, a reformulation of GRPO tailored for dLLMs that uses (i) unconditional clipping to suppress outlier-induced spikes and (ii) self-normalization to constrain updates within the convex hull of per-sample gradients. We further extend StableDRL to block-wise diffusion models via a staircase attention mechanism.

擴散語言模型的強化學習穩定性研究

Stabilizing Reinforcement Learning for Diffusion Language Models

摘要

Support