SimpleRL-Zoo:探索与驯化开放基础模型在现实场景中的零样本强化学习
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
March 24, 2025
作者: Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, Junxian He
cs.AI
摘要
DeepSeek-R1 研究表明,通过基于规则的奖励进行简单强化学习(RL)框架,长链思维(CoT)推理能够自然涌现,其中训练可以直接从基础模型开始——这一范式被称为零RL训练。近期大多数尝试复制零RL训练的研究主要集中在Qwen2.5模型系列上,但我们发现这些基础模型已展现出强大的指令遵循和自我反思能力,因此可能不具备代表性。在本研究中,我们探讨了10种不同基础模型的零RL训练,涵盖了不同家族和规模,包括LLama3-8B、Mistral-7B/24B、DeepSeek-Math-7B、Qwen2.5-math-7B以及从0.5B到32B的所有Qwen2.5模型。通过采用多项关键设计策略——如调整格式奖励和控制查询难度——我们在大多数设置中实现了推理准确性和响应长度的显著提升。然而,通过仔细监控训练动态,我们观察到不同基础模型在训练过程中表现出不同的模式。例如,响应长度的增加并不总是与某些认知行为(如验证,即“顿悟时刻”)的出现相关。值得注意的是,我们首次在非Qwen家族的小型模型中观察到了“顿悟时刻”。我们分享了实现成功零RL训练的关键设计,以及我们的发现和实践。为了促进进一步研究,我们开源了代码、模型和分析工具。
English
DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can
naturally emerge through a simple reinforcement learning (RL) framework with
rule-based rewards, where the training may directly start from the base
models-a paradigm referred to as zero RL training. Most recent efforts to
reproduce zero RL training have primarily focused on the Qwen2.5 model series,
which may not be representative as we find the base models already exhibit
strong instruction-following and self-reflection abilities. In this work, we
investigate zero RL training across 10 diverse base models, spanning different
families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B,
Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several
key design strategies-such as adjusting format reward and controlling query
difficulty-we achieve substantial improvements in both reasoning accuracy and
response length across most settings. However, by carefully monitoring the
training dynamics, we observe that different base models exhibit distinct
patterns during training. For instance, the increased response length does not
always correlate with the emergence of certain cognitive behaviors such as
verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for
the first time in small models not from the Qwen family. We share the key
designs that enable successful zero RL training, along with our findings and
practices. To facilitate further research, we open-source the code, models, and
analysis tools.Summary
AI-Generated Summary