PERL: 人間のフィードバックからのパラメータ効率型強化学習

要旨

人間のフィードバックからの強化学習（RLHF）は、事前学習済み大規模言語モデル（LLM）を人間の好みに合わせるための強力な手法として証明されています。しかし、RLHFを用いたモデルのトレーニングは計算コストが高く、全体的に複雑なプロセスです。本研究では、Huら[2021]によって導入されたパラメータ効率的な手法であるLow-Rank Adaptation（LoRA）を使用して基盤モデルをトレーニングするRLHFを検討します。我々は、「パラメータ効率的強化学習」（PERL）のセットアップを調査し、LoRAを使用して報酬モデルのトレーニングと強化学習を実行します。PERLを従来のファインチューニング（フルチューニング）と比較し、報酬モデリングと強化学習のための7つのベンチマーク（うち2つは新規データセット）を様々な設定で評価します。その結果、PERLは従来のRLHF設定と同等の性能を発揮しつつ、より高速にトレーニングを行い、メモリ使用量も少ないことがわかりました。これにより、RLHFの高性能を維持しつつ、大規模言語モデルのアライメント手法としての採用を制限する計算負荷を軽減できます。また、RLHFに関する研究を促進するため、新規の「賛成/反対」選好データセット「Taskmaster Coffee」と「Taskmaster Ticketing」を公開します。

English

Reinforcement Learning from Human Feedback (RLHF) has proven to be a strong method to align Pretrained Large Language Models (LLMs) with human preferences. But training models with RLHF is computationally expensive, and an overall complex process. In this work, we study RLHF where the underlying models are trained using the parameter efficient method of Low-Rank Adaptation (LoRA) introduced by Hu et al. [2021]. We investigate the setup of "Parameter Efficient Reinforcement Learning" (PERL), in which we perform reward model training and reinforcement learning using LoRA. We compare PERL to conventional fine-tuning (full-tuning) across various configurations for 7 benchmarks, including 2 novel datasets, of reward modeling and reinforcement learning. We find that PERL performs on par with the conventional RLHF setting, while training faster, and with less memory. This enables the high performance of RLHF, while reducing the computational burden that limits its adoption as an alignment technique for Large Language Models. We also release 2 novel thumbs up/down preference datasets: "Taskmaster Coffee", and "Taskmaster Ticketing" to promote research around RLHF.

PERL: 人間のフィードバックからのパラメータ効率型強化学習

PERL: Parameter Efficient Reinforcement Learning from Human Feedback

要旨

Support