ChatPaper.aiChatPaper

直接偏好優化中參考政策的理解

Understanding Reference Policies in Direct Preference Optimization

July 18, 2024
作者: Yixin Liu, Pengfei Liu, Arman Cohan
cs.AI

摘要

直接偏好優化(DPO)已成為大型語言模型(LLMs)微調訓練的廣泛使用方法。在這項研究中,我們探討了DPO的一個少受關注的方面 - 它對參考模型或策略的依賴性。這些參考策略通常被實例化為進一步微調的模型,因為它們可以對DPO的效果施加上限,所以它們很重要。因此,我們在這項研究中探討了三個相關的研究問題。首先,我們探討了在DPO中KL散度約束的最佳強度,該約束懲罰與參考策略的偏差,並發現DPO對這種強度很敏感。接下來,我們通過在DPO和相關學習目標之間提供理論和實證比較,來檢驗參考策略對於指導微調的必要性,展示了DPO的優越性。此外,我們調查了DPO是否受益於更強的參考策略,發現更強的參考策略可以提高性能,但只有當它與被微調的模型相似時才會發生。我們的研究結果突顯了參考策略在DPO中的混淆作用,為最佳實踐提供了見解,同時也確定了未來研究的開放問題。
English
Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs). In this work, we explore an under-investigated aspect of DPO - its dependency on the reference model or policy. Such reference policies, typically instantiated as the model to be further fine-tuned, are important since they can impose an upper limit on DPO's effectiveness. Therefore, we address three related research questions in this work. First, we explore the optimal strength of the KL-divergence constraint in DPO, which penalizes deviations from the reference policy, and find that DPO is sensitive to this strength. Next, we examine the necessity of reference policies for instruction fine-tuning by providing both theoretical and empirical comparisons between DPO and related learning objectives, demonstrating DPO's superiority. Additionally, we investigate whether DPO benefits from stronger reference policies, finding that a stronger reference policy can lead to improved performance, but only when it is similar to the model being fine-tuned. Our findings highlight the confounding role of reference policies in DPO and offer insights for best practices, while also identifying open research questions for future studies.

Summary

AI-Generated Summary

PDF173November 28, 2024