ChatPaper.aiChatPaper

直接偏好优化中的参考策略理解

Understanding Reference Policies in Direct Preference Optimization

July 18, 2024
作者: Yixin Liu, Pengfei Liu, Arman Cohan
cs.AI

摘要

直接偏好优化(Direct Preference Optimization,DPO)已成为大型语言模型(LLMs)微调训练的广泛使用方法。在这项工作中,我们探讨了DPO的一个少为人知的方面 - 它对参考模型或策略的依赖性。这些参考策略通常被实例化为进一步微调的模型,因为它们可以对DPO的有效性施加上限,所以它们非常重要。因此,我们在这项工作中探讨了三个相关的研究问题。首先,我们探讨了KL散度约束在DPO中的最佳强度,该约束惩罚与参考策略的偏差,并发现DPO对这种强度非常敏感。接下来,我们通过理论和实证比较DPO与相关学习目标的必要性,来研究参考策略对微调的重要性,展示了DPO的优越性。此外,我们调查了DPO是否受益于更强的参考策略,发现更强的参考策略可以提高性能,但前提是它与待微调的模型相似。我们的研究突显了参考策略在DPO中的混淆作用,并为最佳实践提供了见解,同时也确定了未来研究的开放问题。
English
Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs). In this work, we explore an under-investigated aspect of DPO - its dependency on the reference model or policy. Such reference policies, typically instantiated as the model to be further fine-tuned, are important since they can impose an upper limit on DPO's effectiveness. Therefore, we address three related research questions in this work. First, we explore the optimal strength of the KL-divergence constraint in DPO, which penalizes deviations from the reference policy, and find that DPO is sensitive to this strength. Next, we examine the necessity of reference policies for instruction fine-tuning by providing both theoretical and empirical comparisons between DPO and related learning objectives, demonstrating DPO's superiority. Additionally, we investigate whether DPO benefits from stronger reference policies, finding that a stronger reference policy can lead to improved performance, but only when it is similar to the model being fine-tuned. Our findings highlight the confounding role of reference policies in DPO and offer insights for best practices, while also identifying open research questions for future studies.

Summary

AI-Generated Summary

PDF173November 28, 2024