ChatPaper.aiChatPaper

使用偏好表示进行一般偏好建模,以对齐语言模型

General Preference Modeling with Preference Representations for Aligning Language Models

October 3, 2024
作者: Yifan Zhang, Ge Zhang, Yue Wu, Kangping Xu, Quanquan Gu
cs.AI

摘要

对建模人类偏好至关重要,以使基础模型与人类价值观保持一致。传统的奖励建模方法,如Bradley-Terry(BT)奖励模型,在表达能力方面存在不足,特别是在处理不传递偏好方面。尽管受监督的成对偏好模型(PairPM)可以表达一般偏好,但它们的实现非常临时,并且无法保证比较对的一致偏好概率。此外,由于在比较多个响应时具有二次查询复杂度,它们会导致高计算成本。在本文中,我们引入了偏好表示学习,一种将响应嵌入潜在空间以高效捕获复杂偏好结构的方法,实现了线性查询复杂度。此外,我们提出了基于偏好分数的通用偏好优化(GPO),它从人类反馈中推广了基于奖励的强化学习。实验结果表明,我们的通用偏好表示模型(GPM)在RewardBench基准测试中比BT奖励模型表现优异,差距高达5.6%,并且有效地建模了循环偏好,其中任何BT奖励模型都表现得像随机猜测一样。此外,通过对AlpacaEval2.0和MT-Bench等下游任务进行评估,使用GPO和我们的通用偏好模型进行语言模型后训练后,显示出高达9.3%的性能改进。这些发现表明,我们的方法可能增强基础模型与微妙人类价值观的一致性。代码可在https://github.com/general-preference/general-preference-model找到。
English
Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. Although supervised pair preference models (PairPM) can express general preferences, their implementation is highly ad-hoc and cannot guarantee a consistent preference probability of compared pairs. Additionally, they impose high computational costs due to their quadratic query complexity when comparing multiple responses. In this paper, we introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Experimental results show that our General Preference representation model (GPM) outperforms the BT reward model on the RewardBench benchmark with a margin of up to 5.6% and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0 and MT-Bench, following the language model post-training with GPO and our general preference model, reveal substantial performance improvements with margins up to 9.3%. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://github.com/general-preference/general-preference-model.

Summary

AI-Generated Summary

PDF94November 16, 2024