ChatPaper.aiChatPaper

大型語言模型中RLHF的秘密 第一部分:PPO

Secrets of RLHF in Large Language Models Part I: PPO

July 11, 2023
作者: Rui Zheng, Shihan Dou, Songyang Gao, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Limao Xiong, Lu Chen, Zhiheng Xi, Yuhao Zhou, Nuo Xu, Wenbin Lai, Minghao Zhu, Rongxiang Weng, Wensen Cheng, Cheng Chang, Zhangyue Yin, Yuan Hua, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang
cs.AI

摘要

大型語言模型(LLMs)已為人工通用智能的推進制定了藍圖。其主要目標是作為以人為中心(有幫助、誠實和無害)的助手。與人類的對齊被認為至關重要,而強化學習與人類反饋(RLHF)則成為支撐這一追求的關鍵技術範式。目前的技術路線通常包括獎勵模型來衡量人類偏好,Proximal Policy Optimization(PPO)來優化政策模型輸出,以及過程監督來提高逐步推理能力。然而,由於獎勵設計、環境交互作用和代理訓練的挑戰,再加上大型語言模型的巨大試驗成本,AI研究人員在激勵技術對齊的發展和LLMs的安全著陸方面面臨重大障礙。RLHF的穩定訓練仍然是一個謎。在第一份報告中,我們解剖了RLHF的框架,重新評估了PPO的內部運作,並探索了構成PPO算法的部分如何影響政策代理訓練。我們確定政策約束是PPO算法有效實施的關鍵因素。因此,我們探索了PPO-max,這是PPO算法的高級版本,以有效提高政策模型的訓練穩定性。基於我們的主要結果,我們對RLHF能力進行了全面分析,並與SFT模型和ChatGPT進行了比較。缺乏開源實現對LLMs對齊的調查提出了重大挑戰。因此,我們渴望發布技術報告、獎勵模型和PPO代碼。
English
Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. Its primary objective is to function as a human-centric (helpful, honest, and harmless) assistant. Alignment with humans assumes paramount significance, and reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. Current technical routes usually include reward models to measure human preferences, Proximal Policy Optimization (PPO) to optimize policy model outputs, and process supervision to improve step-by-step reasoning capabilities. However, due to the challenges of reward design, environment interaction, and agent training, coupled with huge trial and error cost of large language models, there is a significant barrier for AI researchers to motivate the development of technical alignment and safe landing of LLMs. The stable training of RLHF has still been a puzzle. In the first report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training. We identify policy constraints being the key factor for the effective implementation of the PPO algorithm. Therefore, we explore the PPO-max, an advanced version of PPO algorithm, to efficiently improve the training stability of the policy model. Based on our main results, we perform a comprehensive analysis of RLHF abilities compared with SFT models and ChatGPT. The absence of open-source implementations has posed significant challenges to the investigation of LLMs alignment. Therefore, we are eager to release technical reports, reward models and PPO codes
PDF291December 15, 2024