ChatPaper.aiChatPaper

DPO-Shift:調整直接偏好優化的分佈

DPO-Shift: Shifting the Distribution of Direct Preference Optimization

February 11, 2025
作者: Xiliang Yang, Feng Jiang, Qianen Zhang, Lei Zhao, Xiao Li
cs.AI

摘要

直接偏好優化(DPO)及其變體已日益受歡迎,用於使語言模型與人類偏好保持一致。這些方法旨在教導模型更好地區分所選(或偏好)和被拒絕(或不偏好)的回應。然而,先前的研究已經確定,在訓練過程中,所選回應的機率通常會下降,這種現象被稱為概率位移。為了應對這一挑戰,在這項工作中,我們引入了\method,以可控方式改變所選概率的分佈。然後,我們展示了\method在提高所選概率和犧牲獎勵邊緣之間存在基本的折衷,這一點得到了理論分析和實驗驗證的支持。此外,我們展示了\method在下游任務(如MT-Bench和設計的勝率實驗)上優於DPO的優越性。我們相信這項研究表明,DPO的概率位移問題可以通過一個簡單、理論上有根基的解決方案得到有效緩解。我們的代碼可在https://github.com/Meaquadddd/DPO-Shift找到。
English
Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce \method to controllably shift the distribution of the chosen probability. Then, we show that \method exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of \method over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.

Summary

AI-Generated Summary

PDF152February 13, 2025