直接偏好优化的新准则
New Desiderata for Direct Preference Optimization
July 12, 2024
作者: Xiangkun Hu, Tong He, David Wipf
cs.AI
摘要
过去,大型语言模型通常依赖某种形式的强化学习与人类反馈(RLHF)来更好地使模型响应与人类偏好相一致。然而,由于在实施这些RLHF流程时经常观察到的不稳定性,最近引入了各种重新参数化技术,以避开单独学习RL奖励模型的需要。相反,通过最小化一个闭合形式的训练目标直接微调人类偏好,这个过程最初被称为直接偏好优化(DPO),并得到了几个显著后继方法的跟随。尽管在某些现实世界环境中有效,我们引入了新的评估标准,以突显现有DPO方法在插值预训练参考模型和人类偏好的实证度量之间存在未解决的缺陷,以及在如何正则化低质量和高质量响应以及处理约束方面的不可避免的权衡。我们的见解随后激发了一种替代的类DPO损失,可以明显减轻这些限制。实证结果证实了我们分析的显著方面。
English
Large language models in the past have typically relied on some form of
reinforcement learning with human feedback (RLHF) to better align model
responses with human preferences. However, because of oft-observed
instabilities when implementing these RLHF pipelines, various
reparameterization techniques have recently been introduced to sidestep the
need for separately learning an RL reward model. Instead, directly fine-tuning
for human preferences is achieved via the minimization of a single closed-form
training objective, a process originally referred to as direct preference
optimization (DPO) and followed by several notable descendants. Although
effective in certain real-world settings, we introduce new evaluation criteria
that serve to highlight unresolved shortcomings in the ability of existing DPO
methods to interpolate between a pre-trained reference model and empirical
measures of human preferences, as well as unavoidable trade-offs in how low-
and high-quality responses are regularized and constraints are handled. Our
insights then motivate an alternative DPO-like loss that provably mitigates
these limitations. Empirical results serve to corroborate notable aspects of
our analyses.Summary
AI-Generated Summary