LiPO:通过学习排序进行列表偏好优化
LiPO: Listwise Preference Optimization through Learning-to-Rank
February 2, 2024
作者: Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu, Xuanhui Wang
cs.AI
摘要
将语言模型(LMs)与经过精心筛选的人类反馈进行对齐对于控制它们在现实世界应用中的行为至关重要。几种最近的策略优化方法,如DPO和SLiC,作为传统的从人类反馈中强化学习(RLHF)方法的有希望的替代方案。在实践中,人类反馈通常以对多个响应的排名列表的形式出现,以摊销阅读提示的成本。多个响应也可以通过奖励模型或AI反馈进行排名。目前缺乏直接适应响应列表的研究。在这项工作中,我们将LM对齐形式化为一个列表排序问题,并描述了列表偏好优化(LiPO)框架,其中策略可以从给定提示的一个排名合理响应列表中更有效地学习。这种观点明确地与学习排序(LTR)建立了联系,大多数现有的偏好优化工作可以映射到现有的排序目标,特别是成对的目标。在这种联系的基础上,我们提供了一个对LM对齐不太研究的排序目标的检查,其中DPO和SLiC在列表大小为两时作为特例。特别是,我们强调了一种特定方法,LiPO-λ,它利用了最先进的列表排序目标,并以更高级的方式加权每个偏好对。我们展示了LiPO-λ在两个偏好对齐任务上可以明显优于DPO和SLiC。
English
Aligning language models (LMs) with curated human feedback is critical to
control their behaviors in real-world applications. Several recent policy
optimization methods, such as DPO and SLiC, serve as promising alternatives to
the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In
practice, human feedback often comes in a format of a ranked list over multiple
responses to amortize the cost of reading prompt. Multiple responses can also
be ranked by reward models or AI feedback. There lacks such a study on directly
fitting upon a list of responses. In this work, we formulate the LM alignment
as a listwise ranking problem and describe the Listwise Preference Optimization
(LiPO) framework, where the policy can potentially learn more effectively from
a ranked list of plausible responses given the prompt. This view draws an
explicit connection to Learning-to-Rank (LTR), where most existing preference
optimization work can be mapped to existing ranking objectives, especially
pairwise ones. Following this connection, we provide an examination of ranking
objectives that are not well studied for LM alignment withDPO and SLiC as
special cases when list size is two. In particular, we highlight a specific
method, LiPO-{\lambda}, which leverages a state-of-the-art listwise ranking
objective and weights each preference pair in a more advanced manner. We show
that LiPO-{\lambda} can outperform DPO and SLiC by a clear margin on two
preference alignment tasks.