EntRGi：擴散語言模型的熵感知獎勵引導

摘要

獎勵引導技術已在連續擴散模型的測試時適應中取得顯著成功，其透過下游獎勵模型的梯度來更新每個去噪步驟。本研究探討獎勵引導在離散擴散語言模型中的應用，該場景下由於模型自然輸出為離散詞元而無法直接進行微分。現有方法要么以連續鬆弛替代這些離散詞元，要么採用直通估計器等技術。本文揭示這兩類方法的固有缺陷：前者因獎勵模型從未接受連續輸入訓練而導致梯度反饋質量下降；後者因在離散詞元處計算的梯度被用於更新連續對數機率而產生優化偏差。我們的核心創新在於突破此困境，提出名為EntRGi的新機制：基於熵感知的獎勵引導，能動態調節來自獎勵模型的梯度。透過利用模型置信度調控連續鬆弛過程，我們的方法在為獎勵模型提供可靠輸入的同時，顯著提升了獎勵引導效果。我們在70億參數的擴散語言模型上，針對3種不同獎勵模型與3個多技能基準進行實證驗證，結果顯示該方法相較現有頂尖技術獲得持續性改進。

English

Reward guidance has been applied to great success in the test-time adaptation of continuous diffusion models; it updates each denoising step using the gradients from a downstream reward model. We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model because they are discrete tokens. Existing approaches either replace these discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator. In this work, we show the downsides of both these methods. The former degrades gradient feedback because the reward model has never been trained with continuous inputs. The latter involves incorrect optimization because the gradient evaluated at discrete tokens is used to update continuous logits. Our key innovation is to go beyond this tradeoff by introducing a novel mechanism called EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model. By modulating the continuous relaxation using the model's confidence, our approach substantially improves reward guidance while providing reliable inputs to the reward model. We empirically validate our approach on a 7B-parameter diffusion language model across 3 diverse reward models and 3 multi-skill benchmarks, showing consistent improvements over state-of-the-art methods.

EntRGi：擴散語言模型的熵感知獎勵引導

EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models

摘要

Support