透過貝葉斯優化學習可解釋的密集獎勵形狀

摘要

當前基於人類回饋的強化學習（RLHF）流程，用於大型語言模型（LLM）對齊時，通常會為序列分配標量獎勵，並以最終的詞元作為整個序列品質的替代指標。然而，這種做法導致了稀疏的回饋和次優的詞元級別信用分配。在本研究中，我們將獎勵塑形框架為一個專注於詞元級別信用分配的優化問題。我們提出了一種獎勵塑形函數，利用可解釋性方法如SHAP和LIME，從獎勵模型中估計每個詞元的獎勵。為了學習此塑形函數的參數，我們採用了一個雙層優化框架，該框架結合了貝葉斯優化和策略訓練，以處理來自詞元獎勵估計的噪聲。我們的實驗表明，實現更好的詞元級別獎勵歸因平衡，能夠在下游任務上超越基線表現，並在訓練過程中更快地找到最優策略。此外，我們從理論上證明了，作為特徵加性歸因函數的可解釋性方法，能夠保持與原始獎勵相同的最優策略。

English

Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.

透過貝葉斯優化學習可解釋的密集獎勵形狀

Learning Explainable Dense Reward Shapes via Bayesian Optimization

摘要

Support