SPIRAL：通过多智能体多轮次强化学习在零和博弈中实现自我对弈，激励推理能力的发展

摘要

近期强化学习领域的进展表明，语言模型通过在可验证奖励的任务上进行训练，能够发展出复杂的推理能力。然而，这些方法依赖于人工整理的问题-答案对以及特定领域的奖励工程。我们引入了SPIRAL，一个自我对弈框架，在该框架中，模型通过与不断自我提升的版本进行多轮零和博弈来学习，从而消除了对人类监督的依赖。通过自我对弈，SPIRAL生成了一个无限渐进式挑战的问题课程，因为模型必须不断适应更强的对手。为了实现大规模自我对弈训练，我们实现了一个完全在线、多轮、多智能体的强化学习系统，并提出了角色条件优势估计（RAE）以稳定多智能体训练。利用SPIRAL，零和博弈中的自我对弈产生了广泛迁移的推理能力。仅在Kuhn Poker上训练Qwen3-4B-Base，在数学和一般推理上分别实现了8.6%和8.4%的提升，优于在25,000条专家游戏轨迹上的监督微调（SFT）。分析揭示，这种迁移通过三种认知模式实现：系统分解、期望值计算和逐案分析。多游戏训练（井字棋、Kuhn Poker、简单谈判）进一步提升了性能，因为每种游戏都培养了独特的推理优势。将SPIRAL应用于一个强大的推理模型（DeepSeek-R1-Distill-Qwen-7B）仍能带来2.0%的平均提升。这些结果表明，零和博弈自然发展出可迁移的推理能力，为自主推理发展指明了一个有前景的方向。

English

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

SPIRAL：通过多智能体多轮次强化学习在零和博弈中实现自我对弈，激励推理能力的发展

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

摘要

Support