ChatPaper.aiChatPaper

绝对零度:零数据下的强化自我对弈推理

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

May 6, 2025
作者: Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang
cs.AI

摘要

基于可验证奖励的强化学习(RLVR)通过直接从结果导向的奖励中学习,展现了提升大型语言模型推理能力的潜力。近期在零样本设定下运行的RLVR研究虽避免了对推理过程的标注监督,但仍依赖于人工精心策划的问题与答案集进行训练。高质量人类生成样本的稀缺性,引发了关于长期依赖人类监督可扩展性的担忧,这一挑战在语言模型预训练领域已显而易见。此外,在假设的未来中,若人工智能超越人类智能,人类提供的任务可能对超级智能系统的学习潜力有限。为应对这些问题,我们提出了一种名为“绝对零度”的新RLVR范式,其中单一模型学习提出能最大化其自身学习进度的任务,并通过解决这些任务来提升推理能力,无需依赖任何外部数据。在此范式下,我们引入了“绝对零度推理器”(AZR),该系统通过使用代码执行器来验证提出的代码推理任务及确认答案,自我进化其训练课程与推理能力,作为引导开放但基于实际学习的统一可验证奖励来源。尽管完全未使用外部数据进行训练,AZR在编码和数学推理任务上实现了全面的SOTA性能,超越了依赖数万领域内人工精选样本的现有零样本模型。此外,我们展示了AZR能有效应用于不同规模的模型,并与多种模型类别兼容。
English
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

Summary

AI-Generated Summary

PDF782May 7, 2025