SPIRAL：ゼロサムゲームにおける自己対戦が促す推論 - マルチエージェント・マルチターン強化学習によるアプローチ

要旨

最近の強化学習の進展により、言語モデルが検証可能な報酬を伴うタスクの訓練を通じて高度な推論能力を発達させることが示されています。しかし、これらのアプローチは人間が選定した問題と回答のペア、およびドメイン固有の報酬設計に依存しています。本論文では、SPIRALという自己対戦フレームワークを紹介します。このフレームワークでは、モデルが継続的に進化する自身のバージョンと多ターンのゼロサムゲームを行うことで学習し、人間の監督を必要としません。自己対戦を通じて、SPIRALはモデルがより強力な相手に適応しなければならないという状況下で、次第に難易度が上がる問題の無限のカリキュラムを生成します。この大規模な自己対戦訓練を可能にするため、我々は完全にオンラインで多ターン、多エージェントの強化学習システムを大規模言語モデル（LLM）向けに実装し、多エージェント訓練を安定化するための役割条件付きアドバンテージ推定（RAE）を提案します。SPIRALを使用してゼロサムゲームで自己対戦を行うことで、広範に転移可能な推論能力が生み出されます。Kuhn PokerのみでQwen3-4B-Baseを訓練した結果、数学で8.6%、一般的な推論で8.4%の改善が達成され、25,000のエキスパートゲーム軌跡を用いた教師あり微調整（SFT）を上回りました。分析によると、この転移は3つの認知パターンを通じて発生します：体系的な分解、期待値計算、ケースバイケースの分析です。複数ゲーム（TicTacToe、Kuhn Poker、Simple Negotiation）での訓練は、各ゲームが異なる推論の強みを発達させるため、さらなる性能向上をもたらします。強力な推論モデル（DeepSeek-R1-Distill-Qwen-7B）にSPIRALを適用しても、平均2.0%の改善が得られます。これらの結果は、ゼロサムゲームが自然に転移可能な推論能力を発達させることを示しており、自律的な推論開発の有望な方向性を強調しています。

English

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

SPIRAL：ゼロサムゲームにおける自己対戦が促す推論 - マルチエージェント・マルチターン強化学習によるアプローチ

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

要旨

Support