EntRGi:面向扩散语言模型的熵感知奖励引导
EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models
February 4, 2026
作者: Atula Tejaswi, Litu Rout, Constantine Caramanis, Sanjay Shakkottai, Sujay Sanghavi
cs.AI
摘要
奖励引导技术已在连续扩散模型的测试时适应中取得显著成功,该方法通过下游奖励模型的梯度更新每个去噪步骤。本文研究离散扩散语言模型中的奖励引导问题,由于模型输出为离散标记,无法直接对其进行微分求导。现有方法要么用连续松弛替代离散标记,要么采用直通估计器等技术。我们揭示了这两种方法的固有缺陷:前者因奖励模型从未接受连续输入训练而导致梯度反馈质量下降;后者因在离散标记处计算的梯度被用于更新连续对数而存在优化偏差。我们的核心创新在于突破这一两难困境,提出名为EntRGi(熵感知奖励引导)的新机制,通过动态调节奖励模型梯度实现优化。该方法基于模型置信度对连续松弛进行调制,在向奖励模型提供可靠输入的同时显著提升引导效果。我们在70亿参数扩散语言模型上,针对3种异构奖励模型和3个多技能基准开展实验,结果表明该方法相较现有最优技术实现了一致性提升。
English
Reward guidance has been applied to great success in the test-time adaptation of continuous diffusion models; it updates each denoising step using the gradients from a downstream reward model. We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model because they are discrete tokens. Existing approaches either replace these discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator. In this work, we show the downsides of both these methods. The former degrades gradient feedback because the reward model has never been trained with continuous inputs. The latter involves incorrect optimization because the gradient evaluated at discrete tokens is used to update continuous logits. Our key innovation is to go beyond this tradeoff by introducing a novel mechanism called EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model. By modulating the continuous relaxation using the model's confidence, our approach substantially improves reward guidance while providing reliable inputs to the reward model. We empirically validate our approach on a 7B-parameter diffusion language model across 3 diverse reward models and 3 multi-skill benchmarks, showing consistent improvements over state-of-the-art methods.