Open-Reasoner-Zero: 베이스 모델에서 강화 학습 확장을 위한 오픈 소스 접근법

초록

우리는 확장성, 단순성 및 접근성에 초점을 맞춘 대규모 추론 지향 강화학습(RL) 훈련의 첫 번째 오픈소스 구현체인 Open-Reasoner-Zero를 소개합니다. 광범위한 실험을 통해, KL 정규화 없이도 GAE(lambda=1, gamma=1)를 사용한 기본 PPO와 직관적인 규칙 기반 보상만으로도 DeepSeek-R1-Zero에서 관찰된 현상과 유사하게 응답 길이와 벤치마크 성능을 확장할 수 있음을 입증했습니다. DeepSeek-R1-Zero-Qwen-32B와 동일한 기본 모델을 사용하여, 우리의 구현은 AIME2024, MATH500 및 GPQA Diamond 벤치마크에서 우수한 성능을 달성하면서도 DeepSeek-R1-Zero 파이프라인 대비 단 1/10의 훈련 단계만으로도 놀라운 효율성을 보여주었습니다. 오픈소스 정신에 따라, 우리는 다양한 크기의 소스 코드, 파라미터 설정, 훈련 데이터 및 모델 가중치를 공개합니다.

English

We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE (lambda=1, gamma=1) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both response length and benchmark performance, similar to the phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating remarkable efficiency -- requiring only a tenth of the training steps, compared to DeepSeek-R1-Zero pipeline. In the spirit of open source, we release our source code, parameter settings, training data, and model weights across various sizes.

Open-Reasoner-Zero: 베이스 모델에서 강화 학습 확장을 위한 오픈 소스 접근법

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

초록

Support