直接偏好优化的新准则

摘要

过去，大型语言模型通常依赖某种形式的强化学习与人类反馈（RLHF）来更好地使模型响应与人类偏好相一致。然而，由于在实施这些RLHF流程时经常观察到的不稳定性，最近引入了各种重新参数化技术，以避开单独学习RL奖励模型的需要。相反，通过最小化一个闭合形式的训练目标直接微调人类偏好，这个过程最初被称为直接偏好优化（DPO），并得到了几个显著后继方法的跟随。尽管在某些现实世界环境中有效，我们引入了新的评估标准，以突显现有DPO方法在插值预训练参考模型和人类偏好的实证度量之间存在未解决的缺陷，以及在如何正则化低质量和高质量响应以及处理约束方面的不可避免的权衡。我们的见解随后激发了一种替代的类DPO损失，可以明显减轻这些限制。实证结果证实了我们分析的显著方面。

English

Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO) and followed by several notable descendants. Although effective in certain real-world settings, we introduce new evaluation criteria that serve to highlight unresolved shortcomings in the ability of existing DPO methods to interpolate between a pre-trained reference model and empirical measures of human preferences, as well as unavoidable trade-offs in how low- and high-quality responses are regularized and constraints are handled. Our insights then motivate an alternative DPO-like loss that provably mitigates these limitations. Empirical results serve to corroborate notable aspects of our analyses.

直接偏好优化的新准则

New Desiderata for Direct Preference Optimization

摘要

Support