絶対零度：ゼロデータによる強化学習型セルフプレイ推論

要旨

検証可能な報酬を用いた強化学習（RLVR）は、結果ベースの報酬から直接学習することで、大規模言語モデルの推論能力を向上させる可能性を示しています。ゼロ設定で動作する最近のRLVR研究では、推論プロセスのラベル付けにおける監督を回避していますが、依然として訓練用の手作業でキュレートされた質問と回答のコレクションに依存しています。高品質な人間による事例の不足は、言語モデルの事前学習の領域ですでに明らかなように、人間の監督に依存することの長期的な拡張性に関する懸念を引き起こしています。さらに、AIが人間の知能を超える仮想的な未来においては、人間が提供するタスクは超知能システムにとって限られた学習の可能性しか提供しないかもしれません。これらの懸念に対処するため、我々は「Absolute Zero」と呼ばれる新しいRLVRパラダイムを提案します。このパラダイムでは、単一のモデルが自身の学習進捗を最大化するタスクを提案し、それらを解決することで推論能力を向上させ、外部データに一切依存しません。このパラダイムの下で、我々はAbsolute Zero Reasoner（AZR）を紹介します。AZRは、コード実行器を使用して提案されたコード推論タスクを検証し、回答を確認することで、訓練カリキュラムと推論能力を自己進化させ、検証可能な報酬の統一された源として、開放的でありながら根拠のある学習を導きます。外部データを一切使用せずに訓練されたにもかかわらず、AZRはコーディングと数学的推論タスクにおいて全体的にSOTA性能を達成し、数万のドメイン内の人間によるキュレートされた事例に依存する既存のゼロ設定モデルを上回ります。さらに、AZRが異なるモデルスケールに効果的に適用可能であり、さまざまなモデルクラスと互換性があることを実証します。

English

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

絶対零度：ゼロデータによる強化学習型セルフプレイ推論

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

要旨

Support