强化学习中的数据集重置策略优化
Dataset Reset Policy Optimization for RLHF
April 12, 2024
作者: Jonathan D. Chang, Wenhao Shan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun
cs.AI
摘要
基于人类偏好反馈的强化学习(RL)是微调生成模型的流行范式,已经产生了令人印象深刻的模型,如GPT-4和Claude3 Opus。这个框架通常包括两个步骤:从离线偏好数据集中学习奖励模型,然后运行在线RL来优化学习到的奖励模型。在这项工作中,我们借鉴了重置的概念,提出了一个具有可证明保证的新RLHF算法。受到离线偏好数据集提供信息状态(即标记者偏好的数据)的启发,我们的新算法,数据集重置策略优化(DR-PO),通过数据集重置将现有的离线偏好数据集整合到在线策略训练过程中:它直接将策略优化器重置到离线数据集中的状态,而不总是从初始状态分布开始。理论上,我们展示了DR-PO在有限样本复杂度下至少能够学习与离线数据集覆盖的任何策略一样好的性能。在实验中,我们展示了在TL;DR摘要和Anthropic Helpful Harmful(HH)数据集上,DR-PO生成的结果比Proximal Policy Optimization(PPO)和Direction Preference Optimization(DPO)更好,根据GPT4胜率指标。此工作的代码可在https://github.com/Cornell-RL/drpo找到。
English
Reinforcement Learning (RL) from Human Preference-based feedback is a popular
paradigm for fine-tuning generative models, which has produced impressive
models such as GPT-4 and Claude3 Opus. This framework often consists of two
steps: learning a reward model from an offline preference dataset followed by
running online RL to optimize the learned reward model. In this work,
leveraging the idea of reset, we propose a new RLHF algorithm with provable
guarantees. Motivated by the fact that offline preference dataset provides
informative states (i.e., data that is preferred by the labelers), our new
algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing
offline preference dataset into the online policy training procedure via
dataset reset: it directly resets the policy optimizer to the states in the
offline dataset, instead of always starting from the initial state
distribution. In theory, we show that DR-PO learns to perform at least as good
as any policy that is covered by the offline dataset under general function
approximation with finite sample complexity. In experiments, we demonstrate
that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH)
dataset, the generation from DR-PO is better than that from Proximal Policy
Optimization (PPO) and Direction Preference Optimization (DPO), under the
metric of GPT4 win-rate. Code for this work can be found at
https://github.com/Cornell-RL/drpo.Summary
AI-Generated Summary