RLHF 워크플로우: 보상 모델링부터 온라인 RLHF까지

초록

본 기술 보고서에서는 인간 피드백을 통한 온라인 반복 강화 학습(Online Iterative Reinforcement Learning from Human Feedback, RLHF)의 워크플로우를 소개합니다. 이 방법론은 최근 대형 언어 모델(LLM) 연구에서 오프라인 방식 대비 크게 향상된 성능을 보인 것으로 널리 보고되고 있습니다. 그러나 기존의 오픈소스 RLHF 프로젝트들은 여전히 대부분 오프라인 학습 설정에 국한되어 있습니다. 본 보고서는 이러한 격차를 메우고, 온라인 반복 RLHF를 쉽게 재현할 수 있는 상세한 방법론을 제공하는 것을 목표로 합니다. 특히, 제한된 자원을 가진 오픈소스 커뮤니티에서는 온라인 인간 피드백을 얻는 것이 일반적으로 불가능하므로, 다양한 오픈소스 데이터셋을 활용하여 선호 모델을 구축하고, 이를 통해 인간 피드백을 근사화하는 방법을 제안합니다. 이후, 온라인 반복 RLHF의 이론적 통찰과 알고리즘 원리를 논의하고, 상세한 실질적 구현을 제시합니다. 우리가 학습한 LLM 모델인 SFR-Iterative-DPO-LLaMA-3-8B-R은 AlpacaEval-2, Arena-Hard, MT-Bench와 같은 LLM 챗봇 벤치마크뿐만 아니라 HumanEval, TruthfulQA와 같은 학술 벤치마크에서도 인상적인 성능을 달성했습니다. 우리는 지도 미세 조정(Supervised Fine-Tuning, SFT)과 반복 RLHF가 완전한 오픈소스 데이터셋으로도 최첨단 성능을 얻을 수 있음을 입증했습니다. 또한, 우리의 모델, 정제된 데이터셋, 그리고 단계별 코드 가이드북을 공개적으로 제공합니다. 더 자세한 정보는 https://github.com/RLHFlow/RLHF-Reward-Modeling 및 https://github.com/RLHFlow/Online-RLHF를 참조하십시오.

English

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.

RLHF 워크플로우: 보상 모델링부터 온라인 RLHF까지

RLHF Workflow: From Reward Modeling to Online RLHF

초록

Support