MA-RLHF: マクロアクションを用いた人間フィードバックからの強化学習

要旨

人間のフィードバックからの強化学習（RLHF）は、大規模言語モデル（LLMs）を人間の好みに合わせる効果を示しています。ただし、トークンレベルのRLHFは、長いシーケンスにわたる信用割り当ての問題に苦しんでおり、遅延した報酬がモデルに成功した結果にどの行動が貢献したかを識別することを難しくしています。これは学習効率を妨げ、収束を遅らせます。本論文では、単純で効果的なRLHFフレームワークであるMA-RLHFを提案します。このフレームワークは、トークンのシーケンスやより高いレベルの言語構造を含むマクロアクションを学習プロセスに組み込んでいます。この高い抽象度で操作することで、アクションと報酬の時間的距離を縮め、より迅速かつ正確な信用割り当てを促進します。これにより、より安定したポリシーグラディエントの推定値が得られ、各エピソード内での学習効率が向上します。これらの成果は、トレーニングや推論中の計算複雑性を増やさずに実現されます。我々は、テキスト要約、対話生成、質問応答、プログラム合成を含むさまざまなモデルサイズとタスクにわたる包括的な実験を通じて、我々の手法を検証します。我々の手法は、標準のRLHFに比べて、テキスト要約とコード生成では最大30％、対話では18％、質問応答では8％の性能向上を達成します。特に、我々の手法は、トレーニング時間に関してバニラRLHFと比較して1.7倍から2倍速く同等の性能に到達し、さらなるトレーニングでもそれを上回ります。我々は、コードとデータを https://github.com/ernie-research/MA-RLHF で公開します。

English

Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at https://github.com/ernie-research/MA-RLHF .

MA-RLHF: マクロアクションを用いた人間フィードバックからの強化学習

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

要旨

Support