GDSD:強化學習作為擴散語言模型的引導式去噪自蒸餾
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models
May 28, 2026
作者: Xiaohang Tang, Keyue Jiang, Che Liu, Qifang Zhao, Xiaoxiao Xu, Sangwoong Yoon, Ilija Bogunovic
cs.AI
摘要
強化學習可用於提升擴散大型語言模型的策略(去噪器),然而其政策似然函數的難以處理性構成障礙。一類主流且高效的方法將標準強化學習中的似然函數替換為其證據下界,並透過隨機遮罩序列進行估計。儘管此類方法與預訓練高度契合,但其使用證據下界作為似然代理函數所導致的訓練-推理不匹配問題會引入偏差,進而可能降低效能。本研究提出引導式去噪器自蒸餾方法,直接從逆向KL正則化強化學習的閉式最優解所導出的優勢引導自教師模型中蒸餾擴散大型語言模型的去噪器。GDSD透過無正規化目標函數將擴散大型語言模型的去噪器logits與教師模型對齊,從而將強化學習簡化為無似然自蒸餾,因此迴避了訓練-推理不匹配偏差。近期基於證據下界的方法實為應用不同蒸餾散度的實例,但存在GDSD可避免的可診斷病態現象。在LLaDA-8B與Dream-7B模型的規劃、數學與程式碼基準測試中,GDSD在更穩定的訓練獎勵動態下持續優於先前最先進的ELBO方法,測試準確率提升最高達+19.6%。這些結果表明,無需依賴ELBO似然代理函數的直接去噪器自蒸餾,能為擴散大型語言模型提供更穩定且有效的強化學習流程。程式碼已公開於 https://github.com/GaryBall/GDSD。
English
Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to +19.6%. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.