绝对零度:零数据下的强化自我对弈推理
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
May 6, 2025
作者: Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)通过直接从结果导向的奖励中学习,展现了提升大型语言模型推理能力的潜力。近期在零样本设定下运行的RLVR研究虽避免了对推理过程的标注监督,但仍依赖于人工精心策划的问题与答案集进行训练。高质量人类生成样本的稀缺性,引发了关于长期依赖人类监督可扩展性的担忧,这一挑战在语言模型预训练领域已显而易见。此外,在假设的未来中,若人工智能超越人类智能,人类提供的任务可能对超级智能系统的学习潜力有限。为应对这些问题,我们提出了一种名为“绝对零度”的新RLVR范式,其中单一模型学习提出能最大化其自身学习进度的任务,并通过解决这些任务来提升推理能力,无需依赖任何外部数据。在此范式下,我们引入了“绝对零度推理器”(AZR),该系统通过使用代码执行器来验证提出的代码推理任务及确认答案,自我进化其训练课程与推理能力,作为引导开放但基于实际学习的统一可验证奖励来源。尽管完全未使用外部数据进行训练,AZR在编码和数学推理任务上实现了全面的SOTA性能,超越了依赖数万领域内人工精选样本的现有零样本模型。此外,我们展示了AZR能有效应用于不同规模的模型,并与多种模型类别兼容。
English
Reinforcement learning with verifiable rewards (RLVR) has shown promise in
enhancing the reasoning capabilities of large language models by learning
directly from outcome-based rewards. Recent RLVR works that operate under the
zero setting avoid supervision in labeling the reasoning process, but still
depend on manually curated collections of questions and answers for training.
The scarcity of high-quality, human-produced examples raises concerns about the
long-term scalability of relying on human supervision, a challenge already
evident in the domain of language model pretraining. Furthermore, in a
hypothetical future where AI surpasses human intelligence, tasks provided by
humans may offer limited learning potential for a superintelligent system. To
address these concerns, we propose a new RLVR paradigm called Absolute Zero, in
which a single model learns to propose tasks that maximize its own learning
progress and improves reasoning by solving them, without relying on any
external data. Under this paradigm, we introduce the Absolute Zero Reasoner
(AZR), a system that self-evolves its training curriculum and reasoning ability
by using a code executor to both validate proposed code reasoning tasks and
verify answers, serving as an unified source of verifiable reward to guide
open-ended yet grounded learning. Despite being trained entirely without
external data, AZR achieves overall SOTA performance on coding and mathematical
reasoning tasks, outperforming existing zero-setting models that rely on tens
of thousands of in-domain human-curated examples. Furthermore, we demonstrate
that AZR can be effectively applied across different model scales and is
compatible with various model classes.Summary
AI-Generated Summary