道のりは長い：RLHFにおける長さ相関の調査

要旨

人間のフィードバックによる強化学習（RLHF）を用いて大規模言語モデルを調整することに大きな成功が報告されています。オープンソースの選好データセットと報酬モデルにより、一般的なチャット設定を超えた幅広い実験が可能となり、特にウェブ質問応答、要約、多ターン対話などのタスクにおいてシステムをより「役立つ」ものにする取り組みが進められています。役立ちを最適化する際、RLHFがモデルに長い出力を生成させる傾向が一貫して観察されています。本論文では、応答の長さを最適化することが、これらの設定で報告されているRLHFの改善の重要な要因であることを示します。まず、役立ちに関する3つのオープンソース選好データセットで訓練された報酬モデルにおける報酬と長さの関係を調査します。ここでは、長さが報酬と強く相関し、報酬スコアの改善は主に出力長の分布のシフトによってもたらされることがわかります。次に、RLと報酬モデルの学習中に介入を行い、長さを増やさずにRLHFと同様の下流改善を達成できるかどうかを探ります。介入により長さの増加は緩和されますが、すべての設定で一様に効果的ではありません。さらに、長さのみに基づく報酬でRLHFを実行しても、初期のポリシーモデルに対する下流改善の大部分を再現できることがわかり、これらの設定における報酬モデルにはまだ改善の余地が大きいことが示されました。

English

Great successes have been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models. Open-source preference datasets and reward models have enabled wider experimentation beyond generic chat settings, particularly to make systems more "helpful" for tasks like web question answering, summarization, and multi-turn dialogue. When optimizing for helpfulness, RLHF has been consistently observed to drive models to produce longer outputs. This paper demonstrates that optimizing for response length is a significant factor behind RLHF's reported improvements in these settings. First, we study the relationship between reward and length for reward models trained on three open-source preference datasets for helpfulness. Here, length correlates strongly with reward, and improvements in reward score are driven in large part by shifting the distribution over output lengths. We then explore interventions during both RL and reward model learning to see if we can achieve the same downstream improvements as RLHF without increasing length. While our interventions mitigate length increases, they aren't uniformly effective across settings. Furthermore, we find that even running RLHF with a reward based solely on length can reproduce most of the downstream improvements over the initial policy model, showing that reward models in these settings have a long way to go.

道のりは長い：RLHFにおける長さ相関の調査

A Long Way to Go: Investigating Length Correlations in RLHF

要旨

Support