GDSD: 확산 언어 모델을 위한 유도된 디노이저 자기 증류로서의 강화학습

초록

강화 학습(RL)은 확산 대규모 언어 모델(dLLM)의 정책(잡음 제거기)을 개선하는 데 사용될 수 있지만, 정책 가능도의 난해성으로 인해 방해를 받습니다. 지배적이고 효율적인 방법군은 표준 RL의 가능도를 무작위 마스킹된 시퀀스로부터 추정된 증거 하한(ELBO)으로 대체합니다. 사전 학습과 잘 정렬되어 있음에도 불구하고, 이러한 접근 방식은 ELBO를 가능도 대리자로 사용하여 훈련-추론 불일치를 통해 편향을 도입하며, 이는 성능을 저하시킬 수 있습니다. 본 연구에서는 역 KL 정규화 강화 학습의 폐쇄형 최적해로부터 도출된 이점 기반 자기 교사로부터 dLLM의 잡음 제거기를 직접 증류하는 GDSD(Guided Denoiser Self-Distillation)를 제안합니다. GDSD는 정규화 없는 목적 함수를 통해 dLLM의 잡음 제거기 로짓을 교사의 로짓과 일치시킴으로써, RL을 가능도 없는 자기 증류로 축소하여 TIM 편향을 우회합니다. 최근 ELBO 기반 방법들은 서로 다른 증류 발산을 적용한 사례로 나타나지만, GDSD가 회피하는 진단 가능한 병리 현상을 가지고 있습니다. LLaDA-8B 및 Dream-7B를 사용한 계획, 수학, 코딩 벤치마크에서 GDSD는 더 안정적인 훈련 보상 동적을 통해 이전 최첨단 ELBO 기반 방법들을 일관되게 능가하며, 최대 +19.6%의 테스트 정확도 향상을 달성합니다. 이러한 결과는 ELBO 가능도 대리자에 의존하지 않는 직접적인 잡음 제거기 자기 증류가 dLLM에 대해 더 안정적이고 효과적인 RL 절차를 제공할 수 있음을 시사합니다. 코드는 https://github.com/GaryBall/GDSD에서 확인할 수 있습니다.

English

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to +19.6%. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.