SimpleRL-Zoo：探索与驯化开放基础模型在现实场景中的零样本强化学习

摘要

DeepSeek-R1 研究表明，通过基于规则的奖励进行简单强化学习（RL）框架，长链思维（CoT）推理能够自然涌现，其中训练可以直接从基础模型开始——这一范式被称为零RL训练。近期大多数尝试复制零RL训练的研究主要集中在Qwen2.5模型系列上，但我们发现这些基础模型已展现出强大的指令遵循和自我反思能力，因此可能不具备代表性。在本研究中，我们探讨了10种不同基础模型的零RL训练，涵盖了不同家族和规模，包括LLama3-8B、Mistral-7B/24B、DeepSeek-Math-7B、Qwen2.5-math-7B以及从0.5B到32B的所有Qwen2.5模型。通过采用多项关键设计策略——如调整格式奖励和控制查询难度——我们在大多数设置中实现了推理准确性和响应长度的显著提升。然而，通过仔细监控训练动态，我们观察到不同基础模型在训练过程中表现出不同的模式。例如，响应长度的增加并不总是与某些认知行为（如验证，即“顿悟时刻”）的出现相关。值得注意的是，我们首次在非Qwen家族的小型模型中观察到了“顿悟时刻”。我们分享了实现成功零RL训练的关键设计，以及我们的发现和实践。为了促进进一步研究，我们开源了代码、模型和分析工具。

English

DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.

SimpleRL-Zoo：探索与驯化开放基础模型在现实场景中的零样本强化学习

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

摘要

Support