効率的なRLHF：PPOのメモリ使用量削減

要旨

人間のフィードバックを用いた強化学習（RLHF）は、モデルを人間の好みに合わせることで言語モデリングに革命をもたらしました。しかし、強化学習の段階であるProximal Policy Optimization（PPO）は、教師ありファインチューニング（SFT）の3倍以上のメモリを必要とするため、多くの実践者にとって使用が困難です。この問題を解決するため、我々はPPOのメモリ使用量、性能、および訓練時間に関する包括的な分析を行い、メモリ節約技術を検証しました。我々は、まずSFTと報酬モデルを統合し、訓練中にLoRAを動的に「オフ」にするHydra-RLHFを提案します。実験結果は以下の通りです：1. PPO中にLoRAを使用することで、メモリ使用量をSFTよりも小さく抑えつつ、4つの公開ベンチマークで整合性を向上させることができ、2. Hydra-PPOは、LoRA-PPOのサンプルあたりの遅延を最大65％削減しつつ、その性能を維持します。これらの結果は、Hydra-PPOがRLHFのより広範な使用を可能にするシンプルで有望なソリューションであることを示しています。

English

Reinforcement Learning with Human Feedback (RLHF) has revolutionized language modeling by aligning models with human preferences. However, the RL stage, Proximal Policy Optimization (PPO), requires over 3x the memory of Supervised Fine-Tuning (SFT), making it infeasible to use for most practitioners. To address this issue, we present a comprehensive analysis the memory usage, performance, and training time of memory-savings techniques for PPO. We introduce Hydra-RLHF by first integrating the SFT and Reward models and then dynamically turning LoRA "off" during training. Our experiments show: 1. Using LoRA during PPO reduces its memory usage to be smaller than SFT while improving alignment across four public benchmarks, and 2. Hydra-PPO reduces the latency per sample of LoRA-PPO by up to 65% while maintaining its performance. Our results demonstrate that Hydra-PPO is a simple and promising solution for enabling more widespread usage of RLHF.

効率的なRLHF：PPOのメモリ使用量削減

Efficient RLHF: Reducing the Memory Usage of PPO

要旨

Support