ChatPaper.aiChatPaper

Open-Reasoner-Zero:一種開源方法,用於在基礎模型上擴展強化學習

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

March 31, 2025
作者: Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum
cs.AI

摘要

我們推出了Open-Reasoner-Zero,這是首個專注於可擴展性、簡潔性和易用性的大規模推理導向強化學習訓練的開源實現。通過大量實驗,我們證明了一種極簡的方法——使用GAE(λ=1,γ=1)的基礎PPO算法及直接的基於規則的獎勵機制,無需任何KL正則化,便足以在響應長度和基準性能上實現規模化提升,這一現象與DeepSeek-R1-Zero中觀察到的相似。採用與DeepSeek-R1-Zero-Qwen-32B相同的基礎模型,我們的實現在AIME2024、MATH500及GPQA Diamond基準測試中均取得了優異成績,同時展現出顯著的效率——僅需DeepSeek-R1-Zero流程十分之一的訓練步數。秉承開源精神,我們公開了源代碼、參數設置、訓練數據及多種規模的模型權重。
English
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE (lambda=1, gamma=1) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both response length and benchmark performance, similar to the phenomenon observed in DeepSeek-R1-Zero. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating remarkable efficiency -- requiring only a tenth of the training steps, compared to DeepSeek-R1-Zero pipeline. In the spirit of open source, we release our source code, parameter settings, training data, and model weights across various sizes.

Summary

AI-Generated Summary

PDF633April 1, 2025