LiPO:透過學習排序進行列表偏好優化
LiPO: Listwise Preference Optimization through Learning-to-Rank
February 2, 2024
作者: Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu, Xuanhui Wang
cs.AI
摘要
將語言模型(LMs)與經過精心挑選的人類反饋進行對齊對於控制它們在實際應用中的行為至關重要。一些最近的政策優化方法,如DPO和SLiC,作為傳統人類反饋強化學習(RLHF)方法的有希望的替代方案。在實踐中,人類反饋通常以對多個回應的排名列表的形式出現,以攤提閱讀提示的成本。多個回應也可以由獎勵模型或人工智能反饋進行排名。目前缺乏直接擬合回應列表的研究。在這項工作中,我們將LM對齊定義為一個列表排序問題,並描述了列表偏好優化(LiPO)框架,其中策略可以潛在地更有效地從給定提示的一個排名合理回應列表中學習。這種觀點與學習排序(LTR)形成明確聯繫,在那裡大多數現有的偏好優化工作可以映射到現有的排名目標,特別是成對的目標。在這種聯繫之後,我們對於尚未為LM對齊研究的排名目標進行了檢驗,DPO和SLiC作為列表大小為兩時的特例。特別是,我們突出了一種特定方法,LiPO-λ,它利用了最先進的列表排序目標,並以更高級的方式加權每個偏好對。我們展示了LiPO-λ在兩個偏好對齊任務上可以明顯優於DPO和SLiC。
English
Aligning language models (LMs) with curated human feedback is critical to
control their behaviors in real-world applications. Several recent policy
optimization methods, such as DPO and SLiC, serve as promising alternatives to
the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In
practice, human feedback often comes in a format of a ranked list over multiple
responses to amortize the cost of reading prompt. Multiple responses can also
be ranked by reward models or AI feedback. There lacks such a study on directly
fitting upon a list of responses. In this work, we formulate the LM alignment
as a listwise ranking problem and describe the Listwise Preference Optimization
(LiPO) framework, where the policy can potentially learn more effectively from
a ranked list of plausible responses given the prompt. This view draws an
explicit connection to Learning-to-Rank (LTR), where most existing preference
optimization work can be mapped to existing ranking objectives, especially
pairwise ones. Following this connection, we provide an examination of ranking
objectives that are not well studied for LM alignment withDPO and SLiC as
special cases when list size is two. In particular, we highlight a specific
method, LiPO-{\lambda}, which leverages a state-of-the-art listwise ranking
objective and weights each preference pair in a more advanced manner. We show
that LiPO-{\lambda} can outperform DPO and SLiC by a clear margin on two
preference alignment tasks.