SPG:面向掩码扩散语言模型的三明治策略梯度法
SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
October 10, 2025
作者: Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu
cs.AI
摘要
扩散大语言模型(dLLMs)因其能够并行解码多个标记,正逐渐成为自回归模型的高效替代方案。然而,通过强化学习(RL)将dLLMs与人类偏好或任务特定奖励对齐具有挑战性,因为其难以处理的似然度阻碍了标准策略梯度方法的直接应用。尽管先前的工作采用了如证据下界(ELBO)等替代方案,但这些单边近似可能会引入显著的策略梯度偏差。为解决这一问题,我们提出了夹逼策略梯度(SPG),它同时利用真实似然度的上界和下界。实验表明,SPG显著优于基于ELBO或一步估计的基线方法。具体而言,在GSM8K、MATH500、Countdown和Sudoku任务中,SPG相较于最先进的RL方法分别提升了3.6%、2.6%、18.4%和27.0%的准确率。
English
Diffusion large language models (dLLMs) are emerging as an efficient
alternative to autoregressive models due to their ability to decode multiple
tokens in parallel. However, aligning dLLMs with human preferences or
task-specific rewards via reinforcement learning (RL) is challenging because
their intractable log-likelihood precludes the direct application of standard
policy gradient methods. While prior work uses surrogates like the evidence
lower bound (ELBO), these one-sided approximations can introduce significant
policy gradient bias. To address this, we propose the Sandwiched Policy
Gradient (SPG) that leverages both an upper and a lower bound of the true
log-likelihood. Experiments show that SPG significantly outperforms baselines
based on ELBO or one-step estimation. Specifically, SPG improves the accuracy
over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500,
18.4% in Countdown and 27.0% in Sudoku.