ハイブリッド・プリファレンス：人間対AIのためのインスタンスのルーティングを学習する

要旨

人間のフィードバックから学習することにより、言語モデル（LM）を人間の好みと整合させることが可能となりました。ただし、人間の好みを直接収集することは費用がかかり、時間がかかる上にばらつきが大きいことがあります。魅力的な代替手段として、LMから好みを抽出して合成アノテーションのソースとすることが挙げられます。これは人間のアノテーションよりも一貫性があり、安価でスケーラブルである一方、バイアスやエラーにも影響を受けやすいです。本研究では、人間とLMからの入力を組み合わせて、アノテーションの品質を向上させ、人間のアノテーションの総コストを削減するためのルーティングフレームワークを紹介します。我々のアプローチの要点は、人間のアノテーションから恩恵を受けるであろう好みのインスタンスを特定することです。これを最適化問題として定式化します。好みのデータセットと評価メトリックが与えられた場合、任意の人間とLMのアノテーションの組み合わせに対する報酬モデルのパフォーマンスを予測するパフォーマンス予測モデルを訓練し、予測されたパフォーマンスを最大化する組み合わせを選択するためのルーティング戦略を採用します。我々は、人間とLMのラベルとペアになった新しい10Kの好みデータセットであるMultiPrefでパフォーマンス予測モデルを訓練しました。我々のルーティングフレームワークを使用して選択されたLMと直接の人間の好みのハイブリッド混合は、単独で使用するよりも報酬モデルのパフォーマンスを向上させることを示しました。また、他の3つのデータセットで選択的な人間の好み収集をシミュレートし、我々の手法が全体にうまく汎化されることを示しました。さらに、ルーティングモデルからの特徴を分析して、人間のフィードバックから恩恵を受ける可能性のあるインスタンスの特性を特定しました。たとえば、中程度の安全上の懸念や意図の複雑さを持つプロンプトなどです。この研究で使用されたデータセット、アノテーションプラットフォーム、およびソースコードを公開し、将来のより効率的で正確な好みの収集を促進します。

English

Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, directly collecting human preferences can be expensive, time-consuming, and can have high variance. An appealing alternative is to distill preferences from LMs as a source of synthetic annotations as they are more consistent, cheaper, and scale better than human annotation; however, they are also prone to biases and errors. In this work, we introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality, while reducing the total cost of human annotation. The crux of our approach is to identify preference instances that will benefit from human annotations. We formulate this as an optimization problem: given a preference dataset and an evaluation metric, we train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations and employ a routing strategy that selects a combination that maximizes predicted performance. We train the performance prediction model on MultiPref, a new preference dataset with 10K instances paired with human and LM labels. We show that the selected hybrid mixture of LM and direct human preferences using our routing framework achieves better reward model performance compared to using either one exclusively. We simulate selective human preference collection on three other datasets and show that our method generalizes well to all three. We analyze features from the routing model to identify characteristics of instances that can benefit from human feedback, e.g., prompts with a moderate safety concern or moderate intent complexity. We release the dataset, annotation platform, and source code used in this study to foster more efficient and accurate preference collection in the future.

ハイブリッド・プリファレンス：人間対AIのためのインスタンスのルーティングを学習する

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback

要旨

Support