在LLMs中的奖励鲁棒RLHF
Reward-Robust RLHF in LLMs
September 18, 2024
作者: Yuzi Yan, Xingzhou Lou, Jialian Li, Yiping Zhang, Jian Xie, Chao Yu, Yu Wang, Dong Yan, Yuan Shen
cs.AI
摘要
随着大型语言模型(LLMs)不断向更高级智能发展,从人类反馈中进行强化学习(RLHF)越来越被视为实现人工通用智能(AGI)的关键途径。然而,基于奖励模型(RM)的对齐方法的依赖引入了重大挑战,这是由于奖励模型(RMs)固有的不稳定性和缺陷,可能导致严重问题,如奖励欺骗和与人类意图不一致。在本文中,我们介绍了一个旨在解决这些基本挑战的奖励鲁棒的RLHF框架,为LLMs中更可靠和弹性的学习铺平道路。我们的方法引入了一个新颖的优化目标,通过整合贝叶斯奖励模型集成(BRME)来模拟奖励函数的不确定性集合,从而谨慎平衡性能和鲁棒性。这使得框架能够整合名义性能和最小奖励信号,确保即使在不完美的奖励模型下也能实现更稳定的学习。实证结果表明,我们的框架在各种基准测试中始终优于传统的RLHF,表现出更高的准确性和长期稳定性。我们还提供了理论分析,证明了奖励鲁棒的RLHF方法接近恒定奖励设置的稳定性,在随机案例分析中证明其有效性。这些贡献共同突显了该框架提升LLMs与RLHF对齐的性能和稳定性的潜力。
English
As Large Language Models (LLMs) continue to progress toward more advanced
forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is
increasingly seen as a key pathway toward achieving Artificial General
Intelligence (AGI). However, the reliance on reward-model-based (RM-based)
alignment methods introduces significant challenges due to the inherent
instability and imperfections of Reward Models (RMs), which can lead to
critical issues such as reward hacking and misalignment with human intentions.
In this paper, we introduce a reward-robust RLHF framework aimed at addressing
these fundamental challenges, paving the way for more reliable and resilient
learning in LLMs. Our approach introduces a novel optimization objective that
carefully balances performance and robustness by incorporating Bayesian Reward
Model Ensembles (BRME) to model the uncertainty set of reward functions. This
allows the framework to integrate both nominal performance and minimum reward
signals, ensuring more stable learning even with imperfect reward models.
Empirical results demonstrate that our framework consistently outperforms
traditional RLHF across diverse benchmarks, showing improved accuracy and
long-term stability. We also provide a theoretical analysis, demonstrating that
reward-robust RLHF approaches the stability of constant reward settings, which
proves to be effective in a stochastic-case analysis. Together, these
contributions highlight the framework potential to enhance both the performance
and stability of LLM alignment with RLHF.Summary
AI-Generated Summary