R-Zero:从零数据自进化的推理大语言模型
R-Zero: Self-Evolving Reasoning LLM from Zero Data
August 7, 2025
作者: Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, Dong Yu
cs.AI
摘要
自进化大语言模型(LLMs)通过自主生成、优化并从自身经验中学习,为实现超级智能提供了一条可扩展的路径。然而,现有训练此类模型的方法仍严重依赖大量人工标注的任务和标签,通常通过微调或强化学习进行,这成为推动AI系统超越人类智能能力的基本瓶颈。为克服这一限制,我们引入了R-Zero,一个完全自主的框架,能够从零开始生成自己的训练数据。R-Zero从一个基础LLM出发,初始化两个具有不同角色的独立模型:挑战者(Challenger)和解答者(Solver)。这两个模型分别优化并通过互动共同进化:挑战者因提出接近解答者能力边界的任务而获得奖励,解答者则因成功解决挑战者提出的日益复杂的任务而得到奖励。这一过程产生了一个有针对性的、自我提升的课程,无需任何预先存在的任务和标签。实验表明,R-Zero显著提升了不同骨干LLM的推理能力,例如,在数学推理基准上将Qwen3-4B-Base提升了+6.49分,在通用领域推理基准上提升了+7.54分。
English
Self-evolving Large Language Models (LLMs) offer a scalable path toward
super-intelligence by autonomously generating, refining, and learning from
their own experiences. However, existing methods for training such models still
rely heavily on vast human-curated tasks and labels, typically via fine-tuning
or reinforcement learning, which poses a fundamental bottleneck to advancing AI
systems toward capabilities beyond human intelligence. To overcome this
limitation, we introduce R-Zero, a fully autonomous framework that generates
its own training data from scratch. Starting from a single base LLM, R-Zero
initializes two independent models with distinct roles, a Challenger and a
Solver. These models are optimized separately and co-evolve through
interaction: the Challenger is rewarded for proposing tasks near the edge of
the Solver capability, and the Solver is rewarded for solving increasingly
challenging tasks posed by the Challenger. This process yields a targeted,
self-improving curriculum without any pre-existing tasks and labels.
Empirically, R-Zero substantially improves reasoning capability across
different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on
math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.