RLHFのためのデータセットリセットポリシー最適化

要旨

人間の選好に基づくフィードバックからの強化学習（RL）は、生成モデルのファインチューニングにおいて人気のあるパラダイムであり、GPT-4やClaude3 Opusといった印象的なモデルを生み出してきました。このフレームワークは通常、オフラインの選好データセットから報酬モデルを学習し、その後、学習した報酬モデルを最適化するためにオンラインRLを実行するという2つのステップで構成されます。本研究では、リセットのアイデアを活用し、理論的保証を持つ新しいRLHFアルゴリズムを提案します。オフラインの選好データセットが有益な状態（つまり、ラベラーによって選好されたデータ）を提供するという事実に動機づけられ、我々の新しいアルゴリズムであるDataset Reset Policy Optimization（DR-PO）は、既存のオフライン選好データセットをオンラインポリシー訓練プロセスに統合します。具体的には、初期状態分布から常に開始するのではなく、オフラインデータセットの状態に直接ポリシーオプティマイザをリセットします。理論的には、DR-POは、有限のサンプル複雑性を持つ一般的な関数近似の下で、オフラインデータセットによってカバーされる任意のポリシーと少なくとも同等の性能を発揮することを示します。実験では、TL;DR要約タスクとAnthropic Helpful Harmful（HH）データセットの両方において、DR-POによる生成がProximal Policy Optimization（PPO）やDirection Preference Optimization（DPO）よりも優れていることを、GPT4の勝率メトリックの下で実証します。本研究のコードはhttps://github.com/Cornell-RL/drpoで公開されています。

English

Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at https://github.com/Cornell-RL/drpo.

RLHFのためのデータセットリセットポリシー最適化

Dataset Reset Policy Optimization for RLHF

要旨

Support