ChatPaper.aiChatPaper

SPIRAL:通过多智能体多轮次强化学习在零和博弈中实现自我对弈,激励推理能力的发展

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

June 30, 2025
作者: Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques
cs.AI

摘要

近期强化学习领域的进展表明,语言模型通过在可验证奖励的任务上进行训练,能够发展出复杂的推理能力。然而,这些方法依赖于人工整理的问题-答案对以及特定领域的奖励工程。我们引入了SPIRAL,一个自我对弈框架,在该框架中,模型通过与不断自我提升的版本进行多轮零和博弈来学习,从而消除了对人类监督的依赖。通过自我对弈,SPIRAL生成了一个无限渐进式挑战的问题课程,因为模型必须不断适应更强的对手。为了实现大规模自我对弈训练,我们实现了一个完全在线、多轮、多智能体的强化学习系统,并提出了角色条件优势估计(RAE)以稳定多智能体训练。利用SPIRAL,零和博弈中的自我对弈产生了广泛迁移的推理能力。仅在Kuhn Poker上训练Qwen3-4B-Base,在数学和一般推理上分别实现了8.6%和8.4%的提升,优于在25,000条专家游戏轨迹上的监督微调(SFT)。分析揭示,这种迁移通过三种认知模式实现:系统分解、期望值计算和逐案分析。多游戏训练(井字棋、Kuhn Poker、简单谈判)进一步提升了性能,因为每种游戏都培养了独特的推理优势。将SPIRAL应用于一个强大的推理模型(DeepSeek-R1-Distill-Qwen-7B)仍能带来2.0%的平均提升。这些结果表明,零和博弈自然发展出可迁移的推理能力,为自主推理发展指明了一个有前景的方向。
English
Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.
PDF261July 1, 2025