人間のフィードバックからの強化学習におけるデータスケーリングの傾向と影響の探求

要旨

人間のフィードバックによる強化学習（RLHF）は、大規模言語モデルを人間の好みに合わせる上で極めて重要です。最近の研究はアルゴリズムの改善に焦点を当ててきましたが、プロンプトデータの構築の重要性は見過ごされてきました。本論文はこのギャップを埋めるため、RLHFの性能スケーリングにおけるデータ駆動型のボトルネック、特に報酬ハッキングと応答多様性の低下について探求します。報酬ハッキングを軽減するため、推論タスク検証器（RTV）と生成型報酬モデル（GenRM）を組み合わせたハイブリッド報酬システムを導入します。また、応答多様性を維持し学習効果を高めるための新しいプロンプト選択手法、Pre-PPOを提案します。さらに、RLHFトレーニングの初期段階で数学やコーディングタスクを優先することが性能向上に大きく寄与することを発見しました。2つのモデルサイズでの実験により、提案手法の有効性と拡張性が検証されました。結果は、RTVが報酬ハッキングに対して最も耐性があり、次に正解データを用いたGenRM、そしてSFT Best-of-N応答を用いたGenRMが続くことを示しています。我々の戦略は、タスク固有の微妙な違いを迅速に捉えることを可能にし、RLHFの全体的な性能を大幅に向上させます。本研究は、データ構築の重要性を強調し、RLHFにおける性能障壁を克服するための実践的な方法を提供します。

English

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences. While recent research has focused on algorithmic improvements, the importance of prompt-data construction has been overlooked. This paper addresses this gap by exploring data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity. We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness. Additionally, we find that prioritizing mathematical and coding tasks early in RLHF training significantly improves performance. Experiments across two model sizes validate our methods' effectiveness and scalability. Results show that RTV is most resistant to reward hacking, followed by GenRM with ground truth, and then GenRM with SFT Best-of-N responses. Our strategies enable rapid capture of subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work highlights the importance of careful data construction and provides practical methods to overcome performance barriers in RLHF.

人間のフィードバックからの強化学習におけるデータスケーリングの傾向と影響の探求

Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback

要旨

Support