説明可能な密な報酬形状をベイズ最適化により学習する

要旨

現在の大規模言語モデル（LLM）のアラインメントにおける人間のフィードバックからの強化学習（RLHF）パイプラインでは、通常、シーケンスにスカラー報酬を割り当て、最終トークンをシーケンス全体の品質の代理指標として使用します。しかし、これではフィードバックが疎になり、トークンレベルのクレジット割り当てが最適化されません。本研究では、報酬形成をトークンレベルのクレジット割り当てに焦点を当てた最適化問題として定式化します。SHAPやLIMEなどの説明可能性手法を活用した報酬形成関数を提案し、報酬モデルからトークンごとの報酬を推定します。この形成関数のパラメータを学習するために、ベイズ最適化とポリシー訓練を統合した二段階最適化フレームワークを採用し、トークン報酬推定のノイズを処理します。実験結果から、トークンレベルの報酬割り当てのバランスを改善することで、下流タスクにおけるベースラインを上回る性能向上が達成され、訓練中に最適なポリシーをより迅速に見つけることが示されました。さらに、特徴加算的な属性関数である説明可能性手法が、元の報酬と同様に最適ポリシーを維持することを理論的に示します。

English

Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.

説明可能な密な報酬形状をベイズ最適化により学習する

Learning Explainable Dense Reward Shapes via Bayesian Optimization

要旨

Support