대규모 언어 모델에서 RLHF의 비밀 파트 I: PPO

초록

대형 언어 모델(LLMs)은 인공 일반 지능(AGI)의 발전을 위한 청사진을 제시했습니다. 이 모델의 주요 목표는 인간 중심(도움이 되고, 정직하며, 무해한) 보조자로서 기능하는 것입니다. 인간과의 정렬은 최우선 과제로 여겨지며, 인간 피드백을 통한 강화 학습(RLHF)은 이를 달성하기 위한 핵심 기술 패러다임으로 부상했습니다. 현재의 기술 경로는 일반적으로 인간 선호도를 측정하기 위한 보상 모델, 정책 모델 출력을 최적화하기 위한 근접 정책 최적화(PPO), 그리고 단계별 추론 능력을 향상시키기 위한 프로세스 감독을 포함합니다. 그러나 보상 설계, 환경 상호작용, 에이전트 훈련의 어려움과 더불어 대형 언어 모델의 시행착오 비용이 크기 때문에, AI 연구자들이 기술적 정렬과 LLMs의 안전한 착륙을 동기부여하는 데 상당한 장벽이 존재합니다. RLHF의 안정적인 훈련은 여전히 풀리지 않은 문제입니다. 첫 번째 보고서에서 우리는 RLHF의 프레임워크를 분석하고, PPO의 내부 작동 방식을 재평가하며, PPO 알고리즘을 구성하는 부분들이 정책 에이전트 훈련에 어떻게 영향을 미치는지 탐구합니다. 우리는 정책 제약이 PPO 알고리즘의 효과적인 구현을 위한 핵심 요소임을 확인했습니다. 따라서 우리는 PPO 알고리즘의 고급 버전인 PPO-max를 탐구하여 정책 모델의 훈련 안정성을 효율적으로 개선합니다. 주요 결과를 바탕으로, 우리는 RLHF 능력을 SFT 모델 및 ChatGPT와 비교하여 포괄적으로 분석합니다. 오픈소스 구현의 부재는 LLMs 정렬 연구에 상당한 어려움을 초래했습니다. 따라서 우리는 기술 보고서, 보상 모델 및 PPO 코드를 공개하고자 합니다.

English

Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. Its primary objective is to function as a human-centric (helpful, honest, and harmless) assistant. Alignment with humans assumes paramount significance, and reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. Current technical routes usually include reward models to measure human preferences, Proximal Policy Optimization (PPO) to optimize policy model outputs, and process supervision to improve step-by-step reasoning capabilities. However, due to the challenges of reward design, environment interaction, and agent training, coupled with huge trial and error cost of large language models, there is a significant barrier for AI researchers to motivate the development of technical alignment and safe landing of LLMs. The stable training of RLHF has still been a puzzle. In the first report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training. We identify policy constraints being the key factor for the effective implementation of the PPO algorithm. Therefore, we explore the PPO-max, an advanced version of PPO algorithm, to efficiently improve the training stability of the policy model. Based on our main results, we perform a comprehensive analysis of RLHF abilities compared with SFT models and ChatGPT. The absence of open-source implementations has posed significant challenges to the investigation of LLMs alignment. Therefore, we are eager to release technical reports, reward models and PPO codes

대규모 언어 모델에서 RLHF의 비밀 파트 I: PPO

Secrets of RLHF in Large Language Models Part I: PPO

초록

Support