對齊大型語言模型的獎勵轉換和結合

摘要

將語言模型與人類偏好對齊的常見方法是首先從偏好數據中學習獎勵模型，然後使用該獎勵模型來更新語言模型。我們研究了這種方法中出現的兩個密切相關的問題。首先，獎勵模型的任何單調變換都會保留偏好排名；是否存在比其他選擇更“好”的選擇？其次，我們通常希望將語言模型與多個屬性對齊：應如何結合多個獎勵模型？通過對齊程序的概率解釋，我們確定了一種自然的轉換選擇，適用於從 Bradley-Terry 偏好模型學習的獎勵（常見情況）。這種衍生的轉換具有兩個重要特性。首先，它強調改善表現不佳的輸出，而不是已經得分良好的輸出。這有助於減輕欠擬合（某些提示未得到改善）和獎勵欺騙（模型學會利用獎勵模型的錯誤規範化）。其次，它通過將求和與邏輯連接相關，實現了對獎勵的原則性聚合：轉換後的獎勵之和對應於輸出在所有測量屬性中都“好”的概率，這一點我們做出了明確說明。使用 RLHF 將語言模型對齊為既有幫助又無害的實驗顯示，相較於基準（未轉換）方法，取得了顯著的改進。

English

A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model. We study two closely related problems that arise in this approach. First, any monotone transformation of the reward model preserves preference ranking; is there a choice that is ``better'' than others? Second, we often wish to align language models to multiple properties: how should we combine multiple reward models? Using a probabilistic interpretation of the alignment procedure, we identify a natural choice for transformation for (the common case of) rewards learned from Bradley-Terry preference models. This derived transformation has two important properties. First, it emphasizes improving poorly-performing outputs, rather than outputs that already score well. This mitigates both underfitting (where some prompts are not improved) and reward hacking (where the model learns to exploit misspecification of the reward model). Second, it enables principled aggregation of rewards by linking summation to logical conjunction: the sum of transformed rewards corresponds to the probability that the output is ``good'' in all measured properties, in a sense we make precise. Experiments aligning language models to be both helpful and harmless using RLHF show substantial improvements over the baseline (non-transformed) approach.

對齊大型語言模型的獎勵轉換和結合

Transforming and Combining Rewards for Aligning Large Language Models

摘要

Support