효율적인 RLHF: PPO의 메모리 사용량 감소

초록

인간 피드백을 통한 강화 학습(RLHF)은 언어 모델을 인간의 선호에 맞춰 정렬함으로써 언어 모델링 분야에 혁신을 가져왔습니다. 그러나 강화 학습 단계인 Proximal Policy Optimization(PPO)은 지도 미세 조정(SFT)보다 3배 이상의 메모리를 요구하여 대부분의 실무자들이 사용하기 어렵게 만듭니다. 이 문제를 해결하기 위해, 우리는 PPO의 메모리 사용량, 성능, 그리고 훈련 시간에 대한 메모리 절약 기법의 종합적인 분석을 제시합니다. 우리는 SFT와 보상 모델을 통합한 후, 훈련 중에 LoRA를 동적으로 "끄는" 방식으로 Hydra-RLHF를 소개합니다. 실험 결과는 다음과 같습니다: 1. PPO 중 LoRA를 사용하면 메모리 사용량이 SFT보다 작아지면서도 네 가지 공개 벤치마크에서 정렬 성능이 향상되었고, 2. Hydra-PPO는 LoRA-PPO의 샘플당 지연 시간을 최대 65%까지 줄이면서도 성능을 유지했습니다. 우리의 결과는 Hydra-PPO가 RLHF의 보다 광범위한 사용을 가능하게 하는 간단하고 유망한 솔루션임을 보여줍니다.

English

Reinforcement Learning with Human Feedback (RLHF) has revolutionized language modeling by aligning models with human preferences. However, the RL stage, Proximal Policy Optimization (PPO), requires over 3x the memory of Supervised Fine-Tuning (SFT), making it infeasible to use for most practitioners. To address this issue, we present a comprehensive analysis the memory usage, performance, and training time of memory-savings techniques for PPO. We introduce Hydra-RLHF by first integrating the SFT and Reward models and then dynamically turning LoRA "off" during training. Our experiments show: 1. Using LoRA during PPO reduces its memory usage to be smaller than SFT while improving alignment across four public benchmarks, and 2. Hydra-PPO reduces the latency per sample of LoRA-PPO by up to 65% while maintaining its performance. Our results demonstrate that Hydra-PPO is a simple and promising solution for enabling more widespread usage of RLHF.

효율적인 RLHF: PPO의 메모리 사용량 감소

Efficient RLHF: Reducing the Memory Usage of PPO

초록

Support