大型語言模型中RLHF的秘密 第一部分:PPO
Secrets of RLHF in Large Language Models Part I: PPO
July 11, 2023
作者: Rui Zheng, Shihan Dou, Songyang Gao, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Limao Xiong, Lu Chen, Zhiheng Xi, Yuhao Zhou, Nuo Xu, Wenbin Lai, Minghao Zhu, Rongxiang Weng, Wensen Cheng, Cheng Chang, Zhangyue Yin, Yuan Hua, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang
cs.AI
摘要
大型語言模型(LLMs)已為人工通用智能的推進制定了藍圖。其主要目標是作為以人為中心(有幫助、誠實和無害)的助手。與人類的對齊被認為至關重要,而強化學習與人類反饋(RLHF)則成為支撐這一追求的關鍵技術範式。目前的技術路線通常包括獎勵模型來衡量人類偏好,Proximal Policy Optimization(PPO)來優化政策模型輸出,以及過程監督來提高逐步推理能力。然而,由於獎勵設計、環境交互作用和代理訓練的挑戰,再加上大型語言模型的巨大試驗成本,AI研究人員在激勵技術對齊的發展和LLMs的安全著陸方面面臨重大障礙。RLHF的穩定訓練仍然是一個謎。在第一份報告中,我們解剖了RLHF的框架,重新評估了PPO的內部運作,並探索了構成PPO算法的部分如何影響政策代理訓練。我們確定政策約束是PPO算法有效實施的關鍵因素。因此,我們探索了PPO-max,這是PPO算法的高級版本,以有效提高政策模型的訓練穩定性。基於我們的主要結果,我們對RLHF能力進行了全面分析,並與SFT模型和ChatGPT進行了比較。缺乏開源實現對LLMs對齊的調查提出了重大挑戰。因此,我們渴望發布技術報告、獎勵模型和PPO代碼。
English
Large language models (LLMs) have formulated a blueprint for the advancement
of artificial general intelligence. Its primary objective is to function as a
human-centric (helpful, honest, and harmless) assistant. Alignment with humans
assumes paramount significance, and reinforcement learning with human feedback
(RLHF) emerges as the pivotal technological paradigm underpinning this pursuit.
Current technical routes usually include reward models to measure
human preferences, Proximal Policy Optimization (PPO) to optimize
policy model outputs, and process supervision to improve step-by-step
reasoning capabilities. However, due to the challenges of reward design,
environment interaction, and agent training, coupled with huge trial and error
cost of large language models, there is a significant barrier for AI
researchers to motivate the development of technical alignment and safe landing
of LLMs. The stable training of RLHF has still been a puzzle. In the first
report, we dissect the framework of RLHF, re-evaluate the inner workings of
PPO, and explore how the parts comprising PPO algorithms impact policy agent
training. We identify policy constraints being the key factor for the effective
implementation of the PPO algorithm. Therefore, we explore the PPO-max, an
advanced version of PPO algorithm, to efficiently improve the training
stability of the policy model. Based on our main results, we perform a
comprehensive analysis of RLHF abilities compared with SFT models and ChatGPT.
The absence of open-source implementations has posed significant challenges to
the investigation of LLMs alignment. Therefore, we are eager to release
technical reports, reward models and PPO codes