高效RLHF：降低PPO的記憶使用量

摘要

透過人類反饋的強化學習（RLHF）已通過將模型與人類偏好對齊，從而革新了語言建模。然而，在RL階段中，Proximal Policy Optimization（PPO）需要超過Supervised Fine-Tuning（SFT）的3倍記憶體，這使得對大多數從業者來說難以應用。為了解決這個問題，我們對PPO的記憶體使用、性能和訓練時間進行了全面分析，並提出了一系列減少記憶體使用的技術。我們首先通過將SFT和Reward模型整合，然後在訓練期間動態關閉LoRA，提出了Hydra-RLHF。我們的實驗結果顯示：1. 在PPO中使用LoRA可將其記憶體使用量降至小於SFT，同時改善了對四個公共基準的對齊性；2. Hydra-PPO將LoRA-PPO每個樣本的延遲降低高達65％，同時保持其性能。我們的結果表明，Hydra-PPO是一個簡單且有前景的解決方案，可以更廣泛地促進RLHF的應用。

English

Reinforcement Learning with Human Feedback (RLHF) has revolutionized language modeling by aligning models with human preferences. However, the RL stage, Proximal Policy Optimization (PPO), requires over 3x the memory of Supervised Fine-Tuning (SFT), making it infeasible to use for most practitioners. To address this issue, we present a comprehensive analysis the memory usage, performance, and training time of memory-savings techniques for PPO. We introduce Hydra-RLHF by first integrating the SFT and Reward models and then dynamically turning LoRA "off" during training. Our experiments show: 1. Using LoRA during PPO reduces its memory usage to be smaller than SFT while improving alignment across four public benchmarks, and 2. Hydra-PPO reduces the latency per sample of LoRA-PPO by up to 65% while maintaining its performance. Our results demonstrate that Hydra-PPO is a simple and promising solution for enabling more widespread usage of RLHF.

高效RLHF：降低PPO的記憶使用量

Efficient RLHF: Reducing the Memory Usage of PPO

摘要

Support