差異化資訊：從資訊理論視角探討偏好優化

摘要

直接偏好優化（Direct Preference Optimization, DPO）已成為一種標準技術，用於以監督方式將語言模型與人類偏好對齊。儘管其在實證上取得了成功，但其對數比率獎勵參數化背後的理論依據仍不完整。在本研究中，我們通過利用差分信息分佈（Differential Information Distribution, DID）來填補這一空白：這是一種捕捉策略更新期間獲得信息的標記序列分佈。首先，我們證明當偏好標籤編碼了將參考策略轉化為目標策略所需的差分信息時，DPO中的對數比率獎勵作為通過偏好優化學習目標策略的唯一最優形式自然浮現。這一結果自然地導出了對被拒絕響應的最優採樣分佈的閉式表達。其次，我們發現偏好編碼差分信息的條件與對數邊界有序策略的隱含假設——一種在偏好優化中廣泛使用但此前未被識別的歸納偏見——存在根本聯繫。最後，通過分析DID的熵，我們描述了學習低熵差分信息如何強化策略分佈，而高熵差分信息則產生平滑效應，這解釋了對數似然位移現象。我們在合成實驗中驗證了我們的理論發現，並將其擴展到現實世界的指令跟隨數據集。我們的結果表明，學習高熵差分信息對於通用指令跟隨至關重要，而學習低熵差分信息則有利於知識密集型問答。總體而言，我們的工作通過差分信息的視角，為DPO目標、偏好數據結構以及由此產生的策略行為提供了一個統一的視角。

English

Direct Preference Optimization (DPO) has become a standard technique for aligning language models with human preferences in a supervised manner. Despite its empirical success, the theoretical justification behind its log-ratio reward parameterization remains incomplete. In this work, we address this gap by utilizing the Differential Information Distribution (DID): a distribution over token sequences that captures the information gained during policy updates. First, we show that when preference labels encode the differential information required to transform a reference policy into a target policy, the log-ratio reward in DPO emerges as the uniquely optimal form for learning the target policy via preference optimization. This result naturally yields a closed-form expression for the optimal sampling distribution over rejected responses. Second, we find that the condition for preferences to encode differential information is fundamentally linked to an implicit assumption regarding log-margin ordered policies-an inductive bias widely used in preference optimization yet previously unrecognized. Finally, by analyzing the entropy of the DID, we characterize how learning low-entropy differential information reinforces the policy distribution, while high-entropy differential information induces a smoothing effect, which explains the log-likelihood displacement phenomenon. We validate our theoretical findings in synthetic experiments and extend them to real-world instruction-following datasets. Our results suggest that learning high-entropy differential information is crucial for general instruction-following, while learning low-entropy differential information benefits knowledge-intensive question answering. Overall, our work presents a unifying perspective on the DPO objective, the structure of preference data, and resulting policy behaviors through the lens of differential information.

差異化資訊：從資訊理論視角探討偏好優化

Differential Information: An Information-Theoretic Perspective on Preference Optimization

摘要

Support