GDSD: 拡散言語モデルのためのガイド付きデノイザー自己蒸留としての強化学習

要旨

強化学習（RL）は拡散大規模言語モデル（dLLMs）の方策（ノイズ除去器）を改善するために使用できるが、方策尤度の難処理性によって妨げられる。主流で効率的な手法群は、標準的なRLにおける尤度を、ランダムにマスクされた系列から推定されるエビデンス下界（ELBO）で置き換える。事前学習との整合性は高いものの、これらの手法はELBOを尤度の代理として使用することで学習-推論のミスマッチによるバイアスを導入し、性能を低下させる可能性がある。本研究では、逆KL正則化RLの閉形式最適解から導出されるアドバンテージ誘導自己教師から、dLLMのノイズ除去器を直接蒸留するガイド付きノイズ除去器自己蒸留（GDSD）を提案する。GDSDは、正規化不要の目的関数を用いてdLLMのノイズ除去器のロジットを教師のものに一致させ、RLを尤度不要の自己蒸留に帰着させることで、TIMバイアスを回避する。最近のELBOベースの手法は、異なる蒸留ダイバージェンスを適用した事例として現れるが、GDSDが回避する診断可能な病理を伴う。LLaDA-8BおよびDream-7Bを用いた計画、数学、コーディングのベンチマークにおいて、GDSDはより安定した訓練報酬ダイナミクスで先行のELBOベース手法を一貫して凌駕し、最大+19.6%のテスト精度向上を達成した。これらの結果は、ELBO尤度代理に依存しない直接的なノイズ除去器自己蒸留が、dLLMに対してより安定で効果的なRL手順を提供できることを示唆している。コードは https://github.com/GaryBall/GDSD で入手可能である。

English

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to +19.6%. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.