絕對零度:基於零數據的強化自我對弈推理
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
May 6, 2025
作者: Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang
cs.AI
摘要
基於可驗證獎勵的強化學習(RLVR)在提升大型語言模型的推理能力方面展現出潛力,它通過直接從基於結果的獎勵中學習來實現這一目標。近期在零樣本設定下運作的RLVR研究避免了對推理過程的標註監督,但仍依賴於人工整理的問答集進行訓練。高質量、由人類產生的示例的稀缺性引發了對長期依賴人類監督可擴展性的擔憂,這一挑戰在語言模型預訓練領域已顯而易見。此外,在一個假設的未來,當人工智慧超越人類智能時,由人類提供的任務可能對超級智能系統的學習潛力有限。為解決這些問題,我們提出了一種新的RLVR範式,稱為「絕對零點」,在該範式中,單一模型學習提出能最大化自身學習進展的任務,並通過解決這些任務來提升推理能力,而無需依賴任何外部數據。在此範式下,我們引入了「絕對零點推理器」(AZR),這是一個通過使用代碼執行器來驗證提出的代碼推理任務並核實答案,從而自我進化其訓練課程和推理能力的系統,它作為一個統一的、可驗證的獎勵來源,引導開放式但基於實際的學習。儘管AZR完全在沒有外部數據的情況下進行訓練,它在編碼和數學推理任務上達到了整體的SOTA性能,超越了依賴數萬個領域內人工整理示例的現有零樣本模型。此外,我們還展示了AZR能夠有效應用於不同規模的模型,並與多種模型類別兼容。
English
Reinforcement learning with verifiable rewards (RLVR) has shown promise in
enhancing the reasoning capabilities of large language models by learning
directly from outcome-based rewards. Recent RLVR works that operate under the
zero setting avoid supervision in labeling the reasoning process, but still
depend on manually curated collections of questions and answers for training.
The scarcity of high-quality, human-produced examples raises concerns about the
long-term scalability of relying on human supervision, a challenge already
evident in the domain of language model pretraining. Furthermore, in a
hypothetical future where AI surpasses human intelligence, tasks provided by
humans may offer limited learning potential for a superintelligent system. To
address these concerns, we propose a new RLVR paradigm called Absolute Zero, in
which a single model learns to propose tasks that maximize its own learning
progress and improves reasoning by solving them, without relying on any
external data. Under this paradigm, we introduce the Absolute Zero Reasoner
(AZR), a system that self-evolves its training curriculum and reasoning ability
by using a code executor to both validate proposed code reasoning tasks and
verify answers, serving as an unified source of verifiable reward to guide
open-ended yet grounded learning. Despite being trained entirely without
external data, AZR achieves overall SOTA performance on coding and mathematical
reasoning tasks, outperforming existing zero-setting models that rely on tens
of thousands of in-domain human-curated examples. Furthermore, we demonstrate
that AZR can be effectively applied across different model scales and is
compatible with various model classes.Summary
AI-Generated Summary