ChatPaper.aiChatPaper

RLHF 工作流程:從獎勵建模到線上 RLHF

RLHF Workflow: From Reward Modeling to Online RLHF

May 13, 2024
作者: Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
cs.AI

摘要

在本技術報告中,我們介紹了來自人類反饋的在線迭代強化學習(RLHF)的工作流程,根據最近大型語言模型(LLM)文獻,這種方法被廣泛報導在性能上大幅優於其離線對應。然而,現有的開源RLHF項目仍然主要局限於離線學習環境。在本技術報告中,我們旨在填補這一空白,提供一個易於重現的在線迭代RLHF的詳細步驟。特別是,由於對於資源有限的開源社區來說,在線人類反饋通常是不可行的,我們首先通過使用多樣的開源數據集構建偏好模型,並使用構建的代理偏好模型來近似人類反饋。然後,我們討論在線迭代RLHF背後的理論見解和算法原則,並提供詳細的實際實施步驟。我們訓練的LLM,SFR-Iterative-DPO-LLaMA-3-8B-R,在LLM聊天機器人基準測試中取得了令人印象深刻的表現,包括AlpacaEval-2、Arena-Hard和MT-Bench,以及其他學術基準測試,如HumanEval和TruthfulQA。我們已經證明,監督微調(SFT)和迭代RLHF可以使用完全開源的數據集獲得最先進的性能。此外,我們已經將我們的模型、精選數據集和詳盡的逐步代碼指南公開提供。更多詳細信息請參閱https://github.com/RLHFlow/RLHF-Reward-Modeling和https://github.com/RLHFlow/Online-RLHF。
English
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.

Summary

AI-Generated Summary

PDF715December 15, 2024