言語モデルを整列させるための選好表現を用いた一般的な選好モデリング

要旨

人間の好みをモデリングすることは、基盤モデルを人間の価値観に合わせるために重要です。ブラッドリー・テリー（BT）報酬モデルなどの従来の報酬モデリング手法は、特に非推移的な好みに対処する際に表現力が不足しています。監督ペア選好モデル（PairPM）は一般的な好みを表現できますが、その実装は非常に特殊であり、比較されるペアの一貫した選好確率を保証することができません。さらに、複数の応答を比較する際の二次的なクエリ複雑さにより、高い計算コストがかかります。本論文では、応答を潜在空間に埋め込んで複雑な選好構造を効率的に捉えるアプローチである選好表現学習を紹介し、線形クエリ複雑さを実現します。さらに、報酬ベースの強化学習を人間のフィードバックから一般化する選好スコアベースの一般選好最適化（GPO）を提案します。実験結果によると、当社の一般選好表現モデル（GPM）は、RewardBenchベンチマークでBT報酬モデルを最大5.6％上回り、BT報酬モデルがランダムな推測のように振る舞うサイクリックな選好を効果的にモデル化します。さらに、GPOおよび当社の一般選好モデルによる言語モデルの事後トレーニングに続くAlpacaEval2.0およびMT-Benchなどのダウンストリームタスクでの評価は、最大9.3％の性能向上を示しました。これらの結果は、当社の手法が基盤モデルを微妙な人間の価値観と調和させる可能性があることを示しています。コードはhttps://github.com/general-preference/general-preference-modelで入手可能です。

English

Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. Although supervised pair preference models (PairPM) can express general preferences, their implementation is highly ad-hoc and cannot guarantee a consistent preference probability of compared pairs. Additionally, they impose high computational costs due to their quadratic query complexity when comparing multiple responses. In this paper, we introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Experimental results show that our General Preference representation model (GPM) outperforms the BT reward model on the RewardBench benchmark with a margin of up to 5.6% and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0 and MT-Bench, following the language model post-training with GPO and our general preference model, reveal substantial performance improvements with margins up to 9.3%. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://github.com/general-preference/general-preference-model.

言語モデルを整列させるための選好表現を用いた一般的な選好モデリング

General Preference Modeling with Preference Representations for Aligning Language Models

要旨

Support