長文質問応答のための公理的選好モデリング

要旨

GPT-4のような大規模言語モデル（LLM）の驚異的な能力は、人間のフィードバックからの強化学習（RLHF）といったポストトレーニングプロセスに部分的に由来しており、報酬モデルにエンコードされた人間の選好が関与しています。しかし、これらの報酬モデル（RM）は、選好アノテーションがなぜ、またはどのような原則に基づいて行われたのかについての直接的な知識をしばしば欠いています。本研究では、人間の選好により良く整合するようRMを導く原則を特定し、それらを維持するための多様な選好信号を生成する公理的フレームワークを開発します。これらの公理的な信号を用いて、長文質問に対する回答をスコアリングするモデルを訓練します。私たちのアプローチにより、約220Mパラメータの選好モデルが得られ、これはGPT-4よりも頻繁に人間がアノテーションした選好ラベルと一致します。本研究の貢献は以下の通りです：人間とLLMが生成した回答を同じ尺度でスコアリングできる独立した選好モデルの訓練、特定の原則に合わせて訓練データペアを生成する公理的フレームワークの開発、そして少量の公理的な信号が小さなモデルをGPT-4よりも選好スコアリングで優れさせることを示すことです。私たちはこのモデルをhuggingfaceで公開しています： https://huggingface.co/corbyrosset/axiomatic_preference_model

English

The remarkable abilities of large language models (LLMs) like GPT-4 partially stem from post-training processes like Reinforcement Learning from Human Feedback (RLHF) involving human preferences encoded in a reward model. However, these reward models (RMs) often lack direct knowledge of why, or under what principles, the preferences annotations were made. In this study, we identify principles that guide RMs to better align with human preferences, and then develop an axiomatic framework to generate a rich variety of preference signals to uphold them. We use these axiomatic signals to train a model for scoring answers to longform questions. Our approach yields a Preference Model with only about 220M parameters that agrees with gold human-annotated preference labels more often than GPT-4. The contributions of this work include: training a standalone preference model that can score human- and LLM-generated answers on the same scale; developing an axiomatic framework for generating training data pairs tailored to certain principles; and showing that a small amount of axiomatic signals can help small models outperform GPT-4 in preference scoring. We release our model on huggingface: https://huggingface.co/corbyrosset/axiomatic_preference_model

長文質問応答のための公理的選好モデリング

Axiomatic Preference Modeling for Longform Question Answering

要旨

Support