ChatPaper.aiChatPaper

GDSD:强化学习作为扩散语言模型的引导去噪自蒸馏

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

May 28, 2026
作者: Xiaohang Tang, Keyue Jiang, Che Liu, Qifang Zhao, Xiaoxiao Xu, Sangwoong Yoon, Ilija Bogunovic
cs.AI

摘要

强化学习(RL)可用于优化扩散大语言模型(dLLMs)的策略(去噪器),但策略似然的难解性构成了障碍。一类主流且高效的方法将标准RL中的似然替换为证据下界(ELBO),并通过随机掩码序列进行估计。尽管这些方法与预训练高度一致,但将ELBO作为似然代理会导致训练-推理不匹配,引入偏差并可能降低性能。本文提出引导式去噪器自蒸馏(GDSD),从逆向KL正则化RL的闭式最优解导出的优势引导自教师中直接蒸馏dLLMs的去噪器。GDSD通过无归一化目标函数将dLLM去噪器的logits与教师对齐,将RL简化为无需似然的自蒸馏过程,从而规避了训练-推理不匹配偏差。近期基于ELBO的方法可视为应用不同蒸馏散度的实例,但存在GDSD可规避的可诊断病态。在LLaDA-8B和Dream-7B模型的规划、数学与编程基准测试中,GDSD以更稳定的训练奖励动态,持续超越此前最优的ELBO方法,测试准确率提升最高达+19.6%。这些结果表明,不依赖ELBO似然代理的直接去噪器自蒸馏能够为dLLMs提供更稳定高效的RL流程。代码发布于https://github.com/GaryBall/GDSD。
English
Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to +19.6%. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.