通过贝叶斯优化学习可解释的密集奖励函数

摘要

当前，基于人类反馈的强化学习（RLHF）流程在大型语言模型（LLM）对齐任务中，通常为整个序列分配标量奖励，并以最终令牌作为整个序列质量的替代指标。然而，这种做法导致反馈稀疏且令牌级别的信用分配欠佳。在本研究中，我们将奖励塑造视为一个专注于令牌级别信用分配的优化问题。我们提出了一种奖励塑造函数，利用SHAP和LIME等可解释性方法，从奖励模型中估计每个令牌的奖励。为了学习这一塑造函数的参数，我们采用了一个双层优化框架，该框架结合了贝叶斯优化和策略训练，以应对令牌奖励估计中的噪声。实验结果表明，实现更好的令牌级别奖励分配平衡，能够在下游任务上超越基线表现，并在训练过程中更快找到最优策略。此外，我们从理论上证明，作为特征加性归因函数的可解释性方法，能够保持与原始奖励相同的最优策略。

English

Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.

通过贝叶斯优化学习可解释的密集奖励函数

Learning Explainable Dense Reward Shapes via Bayesian Optimization

摘要

Support