大規模言語モデルのアラインメントのための報酬の変換と結合

要旨

言語モデルを人間の好みに合わせるための一般的なアプローチは、まず選好データから報酬モデルを学習し、次にこの報酬モデルを使用して言語モデルを更新することです。本研究では、このアプローチにおいて生じる2つの密接に関連する問題を検討します。第一に、報酬モデルの単調変換は選好の順位を保持しますが、他の選択肢よりも「優れた」選択肢は存在するのでしょうか？第二に、言語モデルを複数の特性に合わせたい場合、複数の報酬モデルをどのように組み合わせるべきでしょうか？アライメント手順の確率的解釈を用いて、Bradley-Terry選好モデルから学習された報酬（一般的なケース）に対する自然な変換の選択を特定します。この導出された変換には2つの重要な特性があります。第一に、すでに高得点を獲得している出力ではなく、パフォーマンスの低い出力の改善を重視します。これにより、アンダーフィッティング（一部のプロンプトが改善されない）と報酬ハッキング（モデルが報酬モデルの誤指定を利用することを学習する）の両方を緩和します。第二に、論理積と合計を結びつけることで、報酬の原則に基づいた集約を可能にします。変換された報酬の合計は、出力が測定されたすべての特性において「良好」である確率に対応します。この意味を厳密に定義します。RLHFを使用して言語モデルを有用かつ無害に合わせる実験では、ベースライン（変換なし）のアプローチと比較して大幅な改善が示されました。

English

A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model. We study two closely related problems that arise in this approach. First, any monotone transformation of the reward model preserves preference ranking; is there a choice that is ``better'' than others? Second, we often wish to align language models to multiple properties: how should we combine multiple reward models? Using a probabilistic interpretation of the alignment procedure, we identify a natural choice for transformation for (the common case of) rewards learned from Bradley-Terry preference models. This derived transformation has two important properties. First, it emphasizes improving poorly-performing outputs, rather than outputs that already score well. This mitigates both underfitting (where some prompts are not improved) and reward hacking (where the model learns to exploit misspecification of the reward model). Second, it enables principled aggregation of rewards by linking summation to logical conjunction: the sum of transformed rewards corresponds to the probability that the output is ``good'' in all measured properties, in a sense we make precise. Experiments aligning language models to be both helpful and harmless using RLHF show substantial improvements over the baseline (non-transformed) approach.

大規模言語モデルのアラインメントのための報酬の変換と結合

Transforming and Combining Rewards for Aligning Large Language Models

要旨

Support