베이지안 최적화를 통해 설명 가능한 밀집 보상 형상 학습

초록

현재 대규모 언어 모델(LLM) 정렬을 위한 인간 피드백 강화 학습(RLHF) 파이프라인은 일반적으로 시퀀스에 스칼라 보상을 할당하며, 최종 토큰을 전체 시퀀스의 품질을 대표하는 지표로 사용합니다. 그러나 이는 희소한 피드백과 최적이 아닌 토큰 수준의 보상 할당을 초래합니다. 본 연구에서는 보상 형성(reward shaping)을 토큰 수준의 보상 할당에 초점을 맞춘 최적화 문제로 재구성합니다. 우리는 SHAP 및 LIME과 같은 설명 가능성 방법을 활용하여 보상 모델로부터 토큰별 보상을 추정하는 보상 형성 함수를 제안합니다. 이 형성 함수의 매개변수를 학습하기 위해, 우리는 토큰 보상 추정에서 발생하는 노이즈를 처리하기 위해 베이지안 최적화와 정책 훈련을 통합하는 이중 수준 최적화 프레임워크를 사용합니다. 우리의 실험 결과, 토큰 수준의 보상 귀속을 더 잘 균형 있게 조정하면 다운스트림 작업에서 기준선 대비 성능 향상을 이끌어내며, 훈련 중에 최적 정책을 더 빠르게 찾을 수 있음을 보여줍니다. 또한, 이론적으로 특징 가산 귀속 함수(feature additive attribution functions)인 설명 가능성 방법이 원래 보상과 동일한 최적 정책을 유지함을 보입니다.

English

Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.

베이지안 최적화를 통해 설명 가능한 밀집 보상 형상 학습

Learning Explainable Dense Reward Shapes via Bayesian Optimization

초록

Support