DPO と RLHF の条件付き同値性: 暗黙の仮定、失敗モード、そして証明可能なアライメント

要旨

直接選好最適化（DPO）は、人間のフィードバックからの強化学習（RLHF）に代わる手法として広く用いられるようになり、より単純な実装で理論的な等価性を提供している。我々は、この等価性が普遍的ではなく条件付きであり、実際には頻繁に違反される暗黙の仮定、すなわち「RLHF最適方策は人間が選好する応答を好まなければならない」に依存していることを証明する。この仮定が成立しない場合、DPOは人間の選好との絶対的な一致ではなく、参照方策に対する相対的な優位性を最適化する。その結果、方策がDPO損失を減少させながらも選好されない応答を好むという病理的な収束が生じる。我々は、この仮定がいつ違反されるかを特徴づけ、望ましくない解空間の存在を示し、そのような場合にDPOとRLHFが根本的に異なる目的を最適化することを証明する。この問題に対処するため、我々は制約付き選好最適化（CPO）を導入する。これはRLHFに制約を追加し、証明可能なアライメントを実現する。さらに、ソフトマージン・ランキングによる幾何学的解釈を提供し、DPOが潜在的に負のターゲットを持つマージン・ランキングを実装していることを明らかにする。我々の理論的解析は、DPOの保証がいつ成立するかを確立し、簡潔さを保ちつつ証明可能なアライメントを実現する解決策を提供する。標準ベンチマークにおける包括的な実験により、CPOが最先端の性能を達成することを実証する。コードは以下のURLで入手可能である：https://github.com/visitworld123/CPO。

English

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.