差异信息：从信息论视角看偏好优化

摘要

直接偏好优化（Direct Preference Optimization, DPO）已成为一种标准技术，用于以监督方式将语言模型与人类偏好对齐。尽管其在实际应用中取得了成功，但其对数比奖励参数化的理论依据仍不完整。在本研究中，我们通过利用差分信息分布（Differential Information Distribution, DID）来填补这一空白：DID是一种捕捉策略更新过程中获得信息的令牌序列分布。首先，我们证明当偏好标签编码了将参考策略转化为目标策略所需的差分信息时，DPO中的对数比奖励作为通过偏好优化学习目标策略的唯一最优形式自然显现。这一结果自然地导出了对拒绝响应进行最优采样的闭式表达式。其次，我们发现偏好编码差分信息的条件与对数边际有序策略的隐含假设——一种在偏好优化中广泛使用但此前未被识别的归纳偏置——存在根本联系。最后，通过分析DID的熵，我们刻画了学习低熵差分信息如何强化策略分布，而高熵差分信息则引发平滑效应，这解释了对数似然位移现象。我们在合成实验中验证了理论发现，并将其扩展至现实世界的指令跟随数据集。我们的结果表明，学习高熵差分信息对于通用指令跟随至关重要，而学习低熵差分信息则有利于知识密集型问答。总体而言，我们的工作通过差分信息的视角，为DPO目标、偏好数据结构及由此产生的策略行为提供了一个统一的理解框架。

English

Direct Preference Optimization (DPO) has become a standard technique for aligning language models with human preferences in a supervised manner. Despite its empirical success, the theoretical justification behind its log-ratio reward parameterization remains incomplete. In this work, we address this gap by utilizing the Differential Information Distribution (DID): a distribution over token sequences that captures the information gained during policy updates. First, we show that when preference labels encode the differential information required to transform a reference policy into a target policy, the log-ratio reward in DPO emerges as the uniquely optimal form for learning the target policy via preference optimization. This result naturally yields a closed-form expression for the optimal sampling distribution over rejected responses. Second, we find that the condition for preferences to encode differential information is fundamentally linked to an implicit assumption regarding log-margin ordered policies-an inductive bias widely used in preference optimization yet previously unrecognized. Finally, by analyzing the entropy of the DID, we characterize how learning low-entropy differential information reinforces the policy distribution, while high-entropy differential information induces a smoothing effect, which explains the log-likelihood displacement phenomenon. We validate our theoretical findings in synthetic experiments and extend them to real-world instruction-following datasets. Our results suggest that learning high-entropy differential information is crucial for general instruction-following, while learning low-entropy differential information benefits knowledge-intensive question answering. Overall, our work presents a unifying perspective on the DPO objective, the structure of preference data, and resulting policy behaviors through the lens of differential information.

差异信息：从信息论视角看偏好优化

Differential Information: An Information-Theoretic Perspective on Preference Optimization

摘要

Support