ChatPaper.aiChatPaper

絕對零度:基於零數據的強化自我對弈推理

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

May 6, 2025
作者: Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang
cs.AI

摘要

基於可驗證獎勵的強化學習(RLVR)在提升大型語言模型的推理能力方面展現出潛力,它通過直接從基於結果的獎勵中學習來實現這一目標。近期在零樣本設定下運作的RLVR研究避免了對推理過程的標註監督,但仍依賴於人工整理的問答集進行訓練。高質量、由人類產生的示例的稀缺性引發了對長期依賴人類監督可擴展性的擔憂,這一挑戰在語言模型預訓練領域已顯而易見。此外,在一個假設的未來,當人工智慧超越人類智能時,由人類提供的任務可能對超級智能系統的學習潛力有限。為解決這些問題,我們提出了一種新的RLVR範式,稱為「絕對零點」,在該範式中,單一模型學習提出能最大化自身學習進展的任務,並通過解決這些任務來提升推理能力,而無需依賴任何外部數據。在此範式下,我們引入了「絕對零點推理器」(AZR),這是一個通過使用代碼執行器來驗證提出的代碼推理任務並核實答案,從而自我進化其訓練課程和推理能力的系統,它作為一個統一的、可驗證的獎勵來源,引導開放式但基於實際的學習。儘管AZR完全在沒有外部數據的情況下進行訓練,它在編碼和數學推理任務上達到了整體的SOTA性能,超越了依賴數萬個領域內人工整理示例的現有零樣本模型。此外,我們還展示了AZR能夠有效應用於不同規模的模型,並與多種模型類別兼容。
English
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

Summary

AI-Generated Summary

PDF822May 7, 2025