ChatPaper.aiChatPaper

SimpleRL-Zoo:探索與馴化開放基礎模型在實際應用中的零樣本強化學習

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

March 24, 2025
作者: Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, Junxian He
cs.AI

摘要

DeepSeek-R1 的研究表明,通过一个简单的基于规则奖励的强化学习(RL)框架,长链思维(CoT)推理可以自然涌现,其中训练可以直接从基础模型开始——这一范式被称为零 RL 训练。最近大多数尝试复现零 RL 训练的工作主要集中在 Qwen2.5 模型系列上,但我们发现这些基础模型已经展现出强大的指令遵循和自我反思能力,因此可能并不具有代表性。在本研究中,我们探讨了 10 种不同基础模型的零 RL 训练,涵盖了不同系列和规模的模型,包括 LLama3-8B、Mistral-7B/24B、DeepSeek-Math-7B、Qwen2.5-math-7B 以及从 0.5B 到 32B 的所有 Qwen2.5 模型。通过采用多项关键设计策略——例如调整格式奖励和控制查询难度——我们在大多数设置中实现了推理准确性和响应长度的显著提升。然而,通过仔细监控训练动态,我们观察到不同基础模型在训练过程中表现出不同的模式。例如,响应长度的增加并不总是与某些认知行为(如验证,即“顿悟时刻”)的出现相关。值得注意的是,我们首次在非 Qwen 系列的小模型中观察到了“顿悟时刻”。我们分享了成功实现零 RL 训练的关键设计,以及我们的发现和实践。为了促进进一步研究,我们开源了代码、模型和分析工具。
English
DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.

Summary

AI-Generated Summary

PDF301March 25, 2025