RLHFワークフロー：報酬モデリングからオンラインRLHFまで

要旨

本技術レポートでは、オンライン反復型人間フィードバック強化学習（RLHF）のワークフローを紹介します。最近の大規模言語モデル（LLM）に関する文献では、オンラインRLHFがオフライン版を大きく上回る性能を示すことが広く報告されています。しかし、既存のオープンソースRLHFプロジェクトは、依然としてオフライン学習の設定に大きく制限されています。本技術レポートでは、このギャップを埋め、再現が容易なオンライン反復型RLHFの詳細なレシピを提供することを目指します。特に、リソースが限られたオープンソースコミュニティにとってオンライン人間フィードバックは通常実現不可能であるため、多様なオープンソースデータセットを使用して選好モデルを構築し、構築された代理選好モデルを用いて人間フィードバックを近似することから始めます。その後、オンライン反復型RLHFの背後にある理論的洞察とアルゴリズムの原則について議論し、詳細な実践的な実装を紹介します。私たちがトレーニングしたLLM、SFR-Iterative-DPO-LLaMA-3-8B-Rは、AlpacaEval-2、Arena-Hard、MT-BenchなどのLLMチャットボットベンチマーク、およびHumanEvalやTruthfulQAなどの学術ベンチマークで印象的な性能を達成しました。教師ありファインチューニング（SFT）と反復型RLHFが、完全にオープンソースのデータセットを使用して最先端の性能を達成できることを示しました。さらに、私たちのモデル、キュレーションされたデータセット、および包括的なステップバイステップのコードガイドブックを公開しています。詳細については、https://github.com/RLHFlow/RLHF-Reward-Modeling および https://github.com/RLHFlow/Online-RLHF を参照してください。

English

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.

RLHFワークフロー：報酬モデリングからオンラインRLHFまで

RLHF Workflow: From Reward Modeling to Online RLHF

要旨

Summary

Support

Support