價值激勵偏好優化:在線和離線強化學習的統一方法
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF
May 29, 2024
作者: Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, Bo Dai
cs.AI
摘要
從人類反饋中學習的強化學習(RLHF)已展示出在對齊大型語言模型(LLMs)與人類偏好方面具有巨大潛力。根據偏好數據的可用性,線上和離線RLHF都是積極研究的領域。一個關鍵瓶頸是如何在從偏好數據學習的獎勵函數中,無論偏好數據如何收集,都要將不確定性估計納入RLHF中。儘管在標準強化學習(RL)中,樂觀或悲觀的不確定性原則已經得到確立,但對於大型語言模型來說,一種實際可實施且理論基礎良好的形式尚不可用,因為在任意策略參數化下,構建置信區間的標準技術變得棘手。
在本文中,我們介紹了一種統一的方法來進行線上和離線RLHF - 價值激勵偏好優化(VPO)- 通過將獎勵函數的最大似然估計與相應的值函數進行正則化,通過一個符號調節,以指示選擇樂觀還是悲觀。VPO還通過隱式獎勵建模直接優化策略,因此與直接偏好優化類似,共享一個更簡單的RLHF流程。VPO的理論保證適用於線上和離線設置,與它們的標準RL對應物的速率相匹配。此外,對於文本摘要和對話的實驗驗證了VPO的實用性和有效性。
English
Reinforcement learning from human feedback (RLHF) has demonstrated great
promise in aligning large language models (LLMs) with human preference.
Depending on the availability of preference data, both online and offline RLHF
are active areas of investigation. A key bottleneck is understanding how to
incorporate uncertainty estimation in the reward function learned from the
preference data for RLHF, regardless of how the preference data is collected.
While the principles of optimism or pessimism under uncertainty are
well-established in standard reinforcement learning (RL), a
practically-implementable and theoretically-grounded form amenable to large
language models is not yet available, as standard techniques for constructing
confidence intervals become intractable under arbitrary policy
parameterizations.
In this paper, we introduce a unified approach to online and offline RLHF --
value-incentivized preference optimization (VPO) -- which regularizes the
maximum-likelihood estimate of the reward function with the corresponding value
function, modulated by a sign to indicate whether the optimism or
pessimism is chosen. VPO also directly optimizes the policy with implicit
reward modeling, and therefore shares a simpler RLHF pipeline similar to direct
preference optimization. Theoretical guarantees of VPO are provided for both
online and offline settings, matching the rates of their standard RL
counterparts. Moreover, experiments on text summarization and dialog verify the
practicality and effectiveness of VPO.Summary
AI-Generated Summary