对齐大型语言模型的奖励转换与组合

摘要

将语言模型与人类偏好对齐的常见方法是首先从偏好数据中学习奖励模型，然后使用该奖励模型来更新语言模型。我们研究了在这一方法中出现的两个密切相关的问题。首先，奖励模型的任何单调转换都会保持偏好排序；是否存在比其他选择更好的选择？其次，我们经常希望将语言模型与多个属性对齐：我们应该如何组合多个奖励模型？通过对齐过程的概率解释，我们确定了一种自然的转换选择，适用于从Bradley-Terry偏好模型学习奖励的常见情况。这种衍生的转换具有两个重要特性。首先，它强调改善表现不佳的输出，而不是已经得分良好的输出。这有助于减轻欠拟合（其中一些提示没有得到改善）和奖励欺骗（模型学习利用奖励模型的错误规范化）。其次，它通过将求和与逻辑连接相关联，实现了奖励的原则性聚合：转换后的奖励之和对应于输出在所有测量属性上都“好”的概率，我们给出了精确的定义。使用RLHF对齐语言模型以既有帮助又无害的实验显示，与基线（未转换）方法相比，取得了显著的改进。

English

A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model. We study two closely related problems that arise in this approach. First, any monotone transformation of the reward model preserves preference ranking; is there a choice that is ``better'' than others? Second, we often wish to align language models to multiple properties: how should we combine multiple reward models? Using a probabilistic interpretation of the alignment procedure, we identify a natural choice for transformation for (the common case of) rewards learned from Bradley-Terry preference models. This derived transformation has two important properties. First, it emphasizes improving poorly-performing outputs, rather than outputs that already score well. This mitigates both underfitting (where some prompts are not improved) and reward hacking (where the model learns to exploit misspecification of the reward model). Second, it enables principled aggregation of rewards by linking summation to logical conjunction: the sum of transformed rewards corresponds to the probability that the output is ``good'' in all measured properties, in a sense we make precise. Experiments aligning language models to be both helpful and harmless using RLHF show substantial improvements over the baseline (non-transformed) approach.

对齐大型语言模型的奖励转换与组合

Transforming and Combining Rewards for Aligning Large Language Models

摘要

Support