ChatPaper.aiChatPaper

直接偏好優化的新要求

New Desiderata for Direct Preference Optimization

July 12, 2024
作者: Xiangkun Hu, Tong He, David Wipf
cs.AI

摘要

過去的大型語言模型通常依賴某種形式的強化學習與人類反饋(RLHF)來更好地使模型回應與人類偏好相一致。然而,由於實施這些RLHF管道時經常觀察到的不穩定性,最近引入了各種重新參數化技術,以避開單獨學習RL獎勵模型的需要。相反,通過最小化單一閉合形式的訓練目標來直接微調人類偏好,這個過程最初被稱為直接偏好優化(DPO),後來出現了幾個顯著的後代。儘管在某些現實世界的情境中是有效的,我們提出了新的評估標準,用以凸顯現有DPO方法在預先訓練的參考模型和人類偏好的實證測量之間插值能力方面尚未解決的缺陷,以及在如何正規化低質量和高質量回應以及處理約束方面不可避免的權衡。我們的見解隨後激發了一種替代的類DPO損失,可以證明地緩解這些限制。實證結果證實了我們分析中的顯著方面。
English
Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO) and followed by several notable descendants. Although effective in certain real-world settings, we introduce new evaluation criteria that serve to highlight unresolved shortcomings in the ability of existing DPO methods to interpolate between a pre-trained reference model and empirical measures of human preferences, as well as unavoidable trade-offs in how low- and high-quality responses are regularized and constraints are handled. Our insights then motivate an alternative DPO-like loss that provably mitigates these limitations. Empirical results serve to corroborate notable aspects of our analyses.

Summary

AI-Generated Summary

PDF114November 28, 2024