DPO 與 RLHF 的條件等價性：隱含假設、失效模式與可證明對齊

摘要

直接偏好優化（DPO）已成為從人類反饋的強化學習（RLHF）的熱門替代方案，具有理論等效性且實現更簡單。我們證明這種等效性是有條件的而非普遍的，取決於一個在實踐中經常被違反的隱含假設：RLHF最優策略必須偏好人類偏好的回應。當這個假設失敗時，DPO優化的是相對於參考策略的相對優勢，而非與人類偏好的絕對對齊，導致病態收斂——策略在降低DPO損失的同時偏好非偏好回應。我們刻畫了該假設何時被違反，展示了不良解空間的存在，並證明在這種情況下DPO和RLHF優化的是根本不同的目標。為解決此問題，我們引入約束偏好優化（CPO），為RLHF增加約束以實現可證明的對齊。我們進一步通過軟間隔排序提供幾何解釋，揭示DPO實現的是可能具有負目標的間隔排序。我們的理論分析確立了DPO保證何時成立，並提供了既保持簡單性又具有可證明對齊的解決方案。在標準基準上的全面實驗表明，CPO達到了最先進的性能。代碼可在：https://github.com/visitworld123/CPO 獲取。

English

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.