价值激励偏好优化:在线和离线RLHF的统一方法
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF
May 29, 2024
作者: Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, Bo Dai
cs.AI
摘要
人类反馈强化学习(RLHF)已经展示出在将大型语言模型(LLMs)与人类偏好对齐方面具有巨大潜力。根据偏好数据的可用性,在线和离线RLHF都是活跃的研究领域。一个关键瓶颈是如何在从偏好数据学习的奖励函数中合理地融入不确定性估计,无论偏好数据是如何收集的。虽然在标准强化学习(RL)中,乐观或悲观的不确定性原则已经得到确认,但对于大型语言模型来说,一个实用且理论基础扎实的形式尚未出现,因为在任意策略参数化下,构建置信区间的标准技术变得难以处理。
在本文中,我们引入了一种统一的在线和离线RLHF方法——价值激励偏好优化(VPO)——它通过将奖励函数的最大似然估计与相应的值函数正则化,通过一个符号来调节乐观或悲观的选择。VPO还直接优化具有隐式奖励建模的策略,因此与直接偏好优化类似,共享更简单的RLHF流程。VPO的理论保证适用于在线和离线设置,与它们的标准RL对应物的速率相匹配。此外,在文本摘要和对话方面的实验验证了VPO的实用性和有效性。
English
Reinforcement learning from human feedback (RLHF) has demonstrated great
promise in aligning large language models (LLMs) with human preference.
Depending on the availability of preference data, both online and offline RLHF
are active areas of investigation. A key bottleneck is understanding how to
incorporate uncertainty estimation in the reward function learned from the
preference data for RLHF, regardless of how the preference data is collected.
While the principles of optimism or pessimism under uncertainty are
well-established in standard reinforcement learning (RL), a
practically-implementable and theoretically-grounded form amenable to large
language models is not yet available, as standard techniques for constructing
confidence intervals become intractable under arbitrary policy
parameterizations.
In this paper, we introduce a unified approach to online and offline RLHF --
value-incentivized preference optimization (VPO) -- which regularizes the
maximum-likelihood estimate of the reward function with the corresponding value
function, modulated by a sign to indicate whether the optimism or
pessimism is chosen. VPO also directly optimizes the policy with implicit
reward modeling, and therefore shares a simpler RLHF pipeline similar to direct
preference optimization. Theoretical guarantees of VPO are provided for both
online and offline settings, matching the rates of their standard RL
counterparts. Moreover, experiments on text summarization and dialog verify the
practicality and effectiveness of VPO.Summary
AI-Generated Summary