ChatPaper.aiChatPaper

大型语言模型中RLHF的秘密 第一部分:PPO

Secrets of RLHF in Large Language Models Part I: PPO

July 11, 2023
作者: Rui Zheng, Shihan Dou, Songyang Gao, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Limao Xiong, Lu Chen, Zhiheng Xi, Yuhao Zhou, Nuo Xu, Wenbin Lai, Minghao Zhu, Rongxiang Weng, Wensen Cheng, Cheng Chang, Zhangyue Yin, Yuan Hua, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang
cs.AI

摘要

大型语言模型(LLMs)已经制定了推动人工通用智能发展的蓝图。其主要目标是作为以人为中心(有帮助、诚实、无害)的助手。与人类的对齐被认为至关重要,而强化学习与人类反馈(RLHF)作为支撑这一追求的关键技术范式应运而生。目前的技术路线通常包括奖励模型来衡量人类偏好,近端策略优化(PPO)来优化策略模型输出,以及过程监督来提高逐步推理能力。然而,由于奖励设计、环境交互和代理训练的挑战,再加上大型语言模型的巨大试错成本,AI研究人员在激励技术对齐和LLMs安全落地的发展方面面临重大障碍。RLHF的稳定训练仍然是一个谜。在第一份报告中,我们剖析了RLHF的框架,重新评估了PPO的内部运作,并探讨了构成PPO算法的各部分如何影响策略代理训练。我们确定策略约束是PPO算法有效实施的关键因素。因此,我们探索了PPO-max,这是PPO算法的高级版本,可以有效提高策略模型的训练稳定性。基于我们的主要结果,我们对RLHF的能力进行了全面分析,与SFT模型和ChatGPT进行了比较。缺乏开源实现对LLMs对齐的调查提出了重大挑战。因此,我们急于发布技术报告、奖励模型和PPO代码。
English
Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. Its primary objective is to function as a human-centric (helpful, honest, and harmless) assistant. Alignment with humans assumes paramount significance, and reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. Current technical routes usually include reward models to measure human preferences, Proximal Policy Optimization (PPO) to optimize policy model outputs, and process supervision to improve step-by-step reasoning capabilities. However, due to the challenges of reward design, environment interaction, and agent training, coupled with huge trial and error cost of large language models, there is a significant barrier for AI researchers to motivate the development of technical alignment and safe landing of LLMs. The stable training of RLHF has still been a puzzle. In the first report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training. We identify policy constraints being the key factor for the effective implementation of the PPO algorithm. Therefore, we explore the PPO-max, an advanced version of PPO algorithm, to efficiently improve the training stability of the policy model. Based on our main results, we perform a comprehensive analysis of RLHF abilities compared with SFT models and ChatGPT. The absence of open-source implementations has posed significant challenges to the investigation of LLMs alignment. Therefore, we are eager to release technical reports, reward models and PPO codes
PDF291December 15, 2024