ChatPaper.aiChatPaper

SPG:面向掩码扩散语言模型的三明治策略梯度

SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

October 10, 2025
作者: Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu
cs.AI

摘要

擴散式大型語言模型(dLLMs)因其能夠並行解碼多個標記而逐漸成為自回歸模型的高效替代方案。然而,由於其難以處理的對數似然性,使得通過強化學習(RL)將dLLMs與人類偏好或任務特定獎勵對齊變得具有挑戰性,這阻礙了標準策略梯度方法的直接應用。雖然先前的研究使用了如證據下界(ELBO)等替代方法,但這些單邊近似可能會引入顯著的策略梯度偏差。為解決這一問題,我們提出了夾層策略梯度(SPG),該方法同時利用了真實對數似然性的上界和下界。實驗表明,SPG顯著優於基於ELBO或一步估計的基線方法。具體而言,在GSM8K、MATH500、Countdown和Sudoku任務中,SPG相較於最先進的RL方法分別提升了3.6%、2.6%、18.4%和27.0%的準確率。
English
Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.
PDF142October 14, 2025