价值激励偏好优化：在线和离线RLHF的统一方法

摘要

人类反馈强化学习（RLHF）已经展示出在将大型语言模型（LLMs）与人类偏好对齐方面具有巨大潜力。根据偏好数据的可用性，在线和离线RLHF都是活跃的研究领域。一个关键瓶颈是如何在从偏好数据学习的奖励函数中合理地融入不确定性估计，无论偏好数据是如何收集的。虽然在标准强化学习（RL）中，乐观或悲观的不确定性原则已经得到确认，但对于大型语言模型来说，一个实用且理论基础扎实的形式尚未出现，因为在任意策略参数化下，构建置信区间的标准技术变得难以处理。在本文中，我们引入了一种统一的在线和离线RLHF方法——价值激励偏好优化（VPO）——它通过将奖励函数的最大似然估计与相应的值函数正则化，通过一个符号来调节乐观或悲观的选择。VPO还直接优化具有隐式奖励建模的策略，因此与直接偏好优化类似，共享更简单的RLHF流程。VPO的理论保证适用于在线和离线设置，与它们的标准RL对应物的速率相匹配。此外，在文本摘要和对话方面的实验验证了VPO的实用性和有效性。

English

Reinforcement learning from human feedback (RLHF) has demonstrated great promise in aligning large language models (LLMs) with human preference. Depending on the availability of preference data, both online and offline RLHF are active areas of investigation. A key bottleneck is understanding how to incorporate uncertainty estimation in the reward function learned from the preference data for RLHF, regardless of how the preference data is collected. While the principles of optimism or pessimism under uncertainty are well-established in standard reinforcement learning (RL), a practically-implementable and theoretically-grounded form amenable to large language models is not yet available, as standard techniques for constructing confidence intervals become intractable under arbitrary policy parameterizations. In this paper, we introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO) -- which regularizes the maximum-likelihood estimate of the reward function with the corresponding value function, modulated by a sign to indicate whether the optimism or pessimism is chosen. VPO also directly optimizes the policy with implicit reward modeling, and therefore shares a simpler RLHF pipeline similar to direct preference optimization. Theoretical guarantees of VPO are provided for both online and offline settings, matching the rates of their standard RL counterparts. Moreover, experiments on text summarization and dialog verify the practicality and effectiveness of VPO.

价值激励偏好优化：在线和离线RLHF的统一方法

Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

摘要

Support