强化学习中的数据集重置策略优化

摘要

基于人类偏好反馈的强化学习（RL）是微调生成模型的流行范式，已经产生了令人印象深刻的模型，如GPT-4和Claude3 Opus。这个框架通常包括两个步骤：从离线偏好数据集中学习奖励模型，然后运行在线RL来优化学习到的奖励模型。在这项工作中，我们借鉴了重置的概念，提出了一个具有可证明保证的新RLHF算法。受到离线偏好数据集提供信息状态（即标记者偏好的数据）的启发，我们的新算法，数据集重置策略优化（DR-PO），通过数据集重置将现有的离线偏好数据集整合到在线策略训练过程中：它直接将策略优化器重置到离线数据集中的状态，而不总是从初始状态分布开始。理论上，我们展示了DR-PO在有限样本复杂度下至少能够学习与离线数据集覆盖的任何策略一样好的性能。在实验中，我们展示了在TL;DR摘要和Anthropic Helpful Harmful（HH）数据集上，DR-PO生成的结果比Proximal Policy Optimization（PPO）和Direction Preference Optimization（DPO）更好，根据GPT4胜率指标。此工作的代码可在https://github.com/Cornell-RL/drpo找到。

English

Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at https://github.com/Cornell-RL/drpo.

强化学习中的数据集重置策略优化

Dataset Reset Policy Optimization for RLHF

摘要

Summary

Support

Support