扩散语言模型的强化学习稳定性研究

摘要

群体相对策略优化（GRPO）对后训练自回归语言模型效果显著，但直接应用于扩散大语言模型时易引发奖励崩溃。我们发现存在两个不兼容根源：首先，GRPO依赖基于序列概率定义的重要性比率，而该比率在扩散大语言模型中难以精确计算，需通过ELBO或平均场似然代理等方法进行估计，导致比率存在固有噪声；其次，标准GRPO的公式设计未考虑估计比率——其条件裁剪机制可能被与模型无关的估计噪声异常绕过，产生梯度尖峰，而固定组大小的归一化操作会在高方差比率估计下放大梯度幅度波动。我们证明这些效应会形成自我强化的不稳定循环，加剧策略漂移并进一步增加比率方差。为打破此循环，我们提出StableDRL——专为扩散大语言模型重构的GRPO方法，采用（i）无条件裁剪抑制异常值引发的梯度尖峰，（ii）自归一化将更新约束在每样本梯度的凸包内。此外，我们通过阶梯注意力机制将StableDRL扩展至块状扩散模型。

English

Group Relative Policy Optimization (GRPO) is highly effective for post-training autoregressive (AR) language models, yet its direct application to diffusion large language models (dLLMs) often triggers reward collapse. We identify two sources of incompatibility. First, GRPO relies on importance ratios defined by sequence probabilities, which are intractable in dLLMs and must be estimated (e.g., via ELBO-based or mean-field likelihood proxies), yielding inherently noisy ratios. Second, standard GRPO's formulation is not designed for estimated ratios: its conditional clipping can be anomalously bypassed by model-agnostic estimation noise, producing gradient spikes, while its fixed group-size normalization amplifies gradient-magnitude fluctuations under high-variance ratio estimates. We show these effects form a self-reinforcing instability loop that drives policy drift and further increases ratio variance. To break this loop, we propose StableDRL, a reformulation of GRPO tailored for dLLMs that uses (i) unconditional clipping to suppress outlier-induced spikes and (ii) self-normalization to constrain updates within the convex hull of per-sample gradients. We further extend StableDRL to block-wise diffusion models via a staircase attention mechanism.

扩散语言模型的强化学习稳定性研究

Stabilizing Reinforcement Learning for Diffusion Language Models

摘要

Support