細粒度な人間のフィードバックが言語モデル訓練のためのより良い報酬を与える

要旨

言語モデル（LM）は、誤った、有害な、または無関係な出力を生成するなど、望ましくないテキスト生成行動を示すことが多い。人間のフィードバックからの強化学習（RLHF）——LMの出力に対する人間の選好判断を学習信号に変換する手法——は、最近これらの問題に対処する可能性を示している。しかし、このような包括的なフィードバックは、長文の出力に関して限られた情報しか伝えず、出力のどの側面がユーザーの選好に影響を与えたか（例えば、どの部分にどの種類の誤りが含まれているか）を示さない。本論文では、細粒度の人間のフィードバック（例えば、どの文が誤っているか、どの部分文が無関係か）を明示的な学習信号として使用する。我々は、Fine-Grained RLHFというフレームワークを導入し、以下の2点において細粒度な報酬関数からの学習を可能にする：(1) 密度——各セグメント（例えば、文）が生成されるごとに報酬を提供する；(2) 複数の報酬モデルを組み込む——異なるフィードバックタイプ（例えば、事実誤認、無関係性、情報の不完全性）に関連する。我々は、解毒化と長文質問応答の実験を行い、このような報酬関数を用いた学習が、自動評価と人間評価の両方によって裏付けられた性能向上につながることを示す。さらに、異なる細粒度報酬モデルの組み合わせを使用してLMの行動をカスタマイズできることを示す。我々は、すべてのデータ、収集した人間のフィードバック、およびコードをhttps://FineGrainedRLHF.github.ioで公開している。

English

Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs. Reinforcement learning from human feedback (RLHF) - where human preference judgments on LM outputs are transformed into a learning signal - has recently shown promise in addressing these issues. However, such holistic feedback conveys limited information on long text outputs; it does not indicate which aspects of the outputs influenced user preference; e.g., which parts contain what type(s) of errors. In this paper, we use fine-grained human feedback (e.g., which sentence is false, which sub-sentence is irrelevant) as an explicit training signal. We introduce Fine-Grained RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects: (1) density, providing a reward after every segment (e.g., a sentence) is generated; and (2) incorporating multiple reward models associated with different feedback types (e.g., factual incorrectness, irrelevance, and information incompleteness). We conduct experiments on detoxification and long-form question answering to illustrate how learning with such reward functions leads to improved performance, supported by both automatic and human evaluation. Additionally, we show that LM behaviors can be customized using different combinations of fine-grained reward models. We release all data, collected human feedback, and codes at https://FineGrainedRLHF.github.io.

細粒度な人間のフィードバックが言語モデル訓練のためのより良い報酬を与える

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

要旨

Support