高效RLHF：减少PPO的内存使用量

摘要

通过人类反馈的强化学习（RLHF）已经通过将模型与人类偏好相一致，彻底改变了语言建模。然而，RL 阶段的 Proximal Policy Optimization（PPO）需要超过 Supervised Fine-Tuning（SFT）3 倍的内存，这使得大多数从业者难以使用。为了解决这个问题，我们对 PPO 的内存使用、性能和训练时间进行了全面分析，提出了一些节省内存的技术。我们首先将 SFT 和 Reward 模型整合，然后在训练过程中动态地关闭 LoRA，从而引入了 Hydra-RLHF。我们的实验表明：1. 在 PPO 中使用 LoRA 可以将其内存使用量降低到小于 SFT 的水平，同时提高了与四个公共基准的一致性；2. Hydra-PPO 可以将 LoRA-PPO 每个样本的延迟降低高达 65%，同时保持其性能。我们的结果表明，Hydra-PPO 是一个简单且有前景的解决方案，可以更广泛地推广 RLHF 的使用。

English

Reinforcement Learning with Human Feedback (RLHF) has revolutionized language modeling by aligning models with human preferences. However, the RL stage, Proximal Policy Optimization (PPO), requires over 3x the memory of Supervised Fine-Tuning (SFT), making it infeasible to use for most practitioners. To address this issue, we present a comprehensive analysis the memory usage, performance, and training time of memory-savings techniques for PPO. We introduce Hydra-RLHF by first integrating the SFT and Reward models and then dynamically turning LoRA "off" during training. Our experiments show: 1. Using LoRA during PPO reduces its memory usage to be smaller than SFT while improving alignment across four public benchmarks, and 2. Hydra-PPO reduces the latency per sample of LoRA-PPO by up to 65% while maintaining its performance. Our results demonstrate that Hydra-PPO is a simple and promising solution for enabling more widespread usage of RLHF.

高效RLHF：减少PPO的内存使用量

Efficient RLHF: Reducing the Memory Usage of PPO

摘要

Support