DPO与RLHF的条件等价性：隐含假设、失败模式与可证明的对齐

摘要

直接偏好优化（DPO）已成为从人类反馈中强化学习（RLHF）的流行替代方案，其理论等价性更简单实现。我们证明这种等价性是有条件的而非普遍的，它依赖于一个在实践中经常被违反的隐含假设：RLHF最优策略必须偏好人类更偏好的响应。当该假设不成立时，DPO优化的是相对于参考策略的相对优势而非与人类偏好的绝对对齐，导致病态收敛——策略在降低DPO损失的同时反而偏好不被人类偏好的响应。我们刻画了该假设被违反的条件，展示了不良解空间的存在，并证明DPO和RLHF在此类情况下优化的目标根本不同。为解决此问题，我们提出约束偏好优化（CPO），通过引入约束增强RLHF以实现可证明的对齐。我们进一步通过软间隔排序给出几何解释，揭示DPO实现了可能具有负目标的间隔排序。我们的理论分析确定了DPO保证何时成立，并提供了保留简单性且实现可证明对齐的解决方案。在标准基准上的全面实验表明，CPO达到了最先进的性能。代码可在 https://github.com/visitworld123/CPO 获取。

English

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.