RLHF 工作流程:从奖励建模到在线 RLHF
RLHF Workflow: From Reward Modeling to Online RLHF
May 13, 2024
作者: Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
cs.AI
摘要
在本技术报告中,我们介绍了在线迭代强化学习从人类反馈(RLHF)的工作流程,据广泛报道,它在最近的大型语言模型(LLM)文献中远远优于其离线对应物。然而,现有的开源RLHF项目仍然主要局限于离线学习环境。在本技术报告中,我们旨在填补这一空白,并提供一个易于复现的在线迭代RLHF详细步骤。特别是,由于在线人类反馈通常对于资源有限的开源社区来说是不可行的,我们首先使用各种开源数据集构建偏好模型,并使用构建的代理偏好模型来近似人类反馈。然后,我们讨论在线迭代RLHF背后的理论见解和算法原理,接着是详细的实际实现。我们训练的LLM,SFR-Iterative-DPO-LLaMA-3-8B-R,在LLM聊天机器人基准测试中表现出色,包括AlpacaEval-2、Arena-Hard和MT-Bench,以及其他学术基准测试,如HumanEval和TruthfulQA。我们已经证明,监督微调(SFT)和迭代RLHF可以利用完全开源数据集获得最先进的性能。此外,我们已经公开提供了我们的模型、策划数据集和详细的逐步代码指南。更多详细信息,请参阅https://github.com/RLHFlow/RLHF-Reward-Modeling 和 https://github.com/RLHFlow/Online-RLHF。
English
We present the workflow of Online Iterative Reinforcement Learning from Human
Feedback (RLHF) in this technical report, which is widely reported to
outperform its offline counterpart by a large margin in the recent large
language model (LLM) literature. However, existing open-source RLHF projects
are still largely confined to the offline learning setting. In this technical
report, we aim to fill in this gap and provide a detailed recipe that is easy
to reproduce for online iterative RLHF. In particular, since online human
feedback is usually infeasible for open-source communities with limited
resources, we start by constructing preference models using a diverse set of
open-source datasets and use the constructed proxy preference model to
approximate human feedback. Then, we discuss the theoretical insights and
algorithmic principles behind online iterative RLHF, followed by a detailed
practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R,
achieves impressive performance on LLM chatbot benchmarks, including
AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks
such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning
(SFT) and iterative RLHF can obtain state-of-the-art performance with fully
open-source datasets. Further, we have made our models, curated datasets, and
comprehensive step-by-step code guidebooks publicly available. Please refer to
https://github.com/RLHFlow/RLHF-Reward-Modeling and
https://github.com/RLHFlow/Online-RLHF for more detailed information.Summary
AI-Generated Summary