DPO-Shift: 直接選好最適化の分布シフト

要旨

直接選好最適化（DPO）およびその派生手法は、言語モデルを人間の選好と整合させるために、ますます人気を集めています。これらの手法は、モデルに選択された（または好ましいとされる）応答と拒否された（または好ましくないとされる）応答をよりよく区別するように教えることを目指しています。しかし、これまでの研究では、選択された応答の確率がトレーニング中にしばしば低下するという現象が特定されており、これを尤度の変位と呼んでいます。この課題に取り組むために、本研究では、選択された確率の分布を制御可能にシフトさせる\method を導入します。その後、\method が選択された確率を向上させることと報酬のマージンを犠牲にすることとの間に、理論的分析と実験的検証の両方によって支持される基本的なトレードオフがあることを示します。さらに、MT-Benchや設計された勝率実験などの下流タスクにおいて、\method がDPOよりも優れていることを実証します。本研究は、DPOの尤度の変位問題が、単純で理論的に基づいた解決策によって効果的に緩和され得ることを示していると考えています。当該コードは、https://github.com/Meaquadddd/DPO-Shift で入手可能です。

English

Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce \method to controllably shift the distribution of the chosen probability. Then, we show that \method exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of \method over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.

DPO-Shift: 直接選好最適化の分布シフト

DPO-Shift: Shifting the Distribution of Direct Preference Optimization

要旨

Support