ChatPaper.aiChatPaper

強化學習中的資料集重設政策優化

Dataset Reset Policy Optimization for RLHF

April 12, 2024
作者: Jonathan D. Chang, Wenhao Shan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun
cs.AI

摘要

基於人類偏好反饋的強化學習(RL)是微調生成模型的一種流行範式,已經產生了令人印象深刻的模型,如GPT-4和Claude3 Opus。這個框架通常包括兩個步驟:從離線偏好數據集中學習獎勵模型,然後運行在線RL以優化所學習的獎勵模型。在這項工作中,借鑒重置的概念,我們提出了一種具有可證明保證的新RLHF算法。受到離線偏好數據集提供信息狀態(即標籤者偏好的數據)的啟發,我們的新算法,即數據集重置策略優化(DR-PO),通過數據集重置將現有的離線偏好數據集整合到在線策略訓練過程中:它直接將策略優化器重置為離線數據集中的狀態,而不是始終從初始狀態分佈開始。理論上,我們展示了DR-PO在具有有限樣本複雜度的一般函數逼近下至少與離線數據集涵蓋的任何策略一樣好。在實驗中,我們展示了在TL;DR摘要和人類有益有害(HH)數據集上,DR-PO生成的效果優於Proximal Policy Optimization(PPO)和Direction Preference Optimization(DPO),根據GPT4勝率指標。此工作的代碼可在https://github.com/Cornell-RL/drpo找到。
English
Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at https://github.com/Cornell-RL/drpo.

Summary

AI-Generated Summary

PDF90December 15, 2024