SPIRAL: 제로섬 게임에서의 셀프 플레이가 다중 에이전트 다중 턴 강화 학습을 통해 추론을 유도하는 방법

초록

최근 강화 학습의 발전은 언어 모델이 검증 가능한 보상이 있는 작업에 대한 훈련을 통해 정교한 추론 능력을 개발할 수 있음을 보여주었지만, 이러한 접근 방식은 인간이 선별한 문제-답변 쌍과 도메인 특화된 보상 설계에 의존합니다. 우리는 SPIRAL이라는 자가 대결(self-play) 프레임워크를 소개합니다. 이 프레임워크에서는 모델이 지속적으로 개선되는 자신의 버전과 다중 턴, 제로섬 게임을 하며 학습함으로써 인간의 감독이 필요 없습니다. 자가 대결을 통해 SPIRAL은 점점 더 어려워지는 문제들의 무한한 커리큘럼을 생성하며, 모델은 더 강력한 상대에 지속적으로 적응해야 합니다. 이러한 대규모 자가 대결 훈련을 가능하게 하기 위해, 우리는 LLM을 위한 완전 온라인, 다중 턴, 다중 에이전트 강화 학습 시스템을 구현하고, 다중 에이전트 훈련을 안정화하기 위해 역할 기반 이점 추정(RAE)을 제안합니다. SPIRAL을 사용하여 제로섬 게임에서의 자가 대결은 광범위하게 전이 가능한 추론 능력을 생성합니다. Kuhn Poker만으로 Qwen3-4B-Base를 훈련시킨 결과, 수학에서 8.6%, 일반 추론에서 8.4%의 향상을 달성하여 25,000개의 전문가 게임 궤적에 대한 SFT를 능가했습니다. 분석 결과, 이러한 전이는 세 가지 인지 패턴을 통해 발생함이 밝혀졌습니다: 체계적인 분해, 기대값 계산, 사례별 분석. 다중 게임 훈련(TicTacToe, Kuhn Poker, Simple Negotiation)은 각 게임이 고유한 추론 강점을 개발함에 따라 성능을 더욱 향상시킵니다. 강력한 추론 모델(DeepSeek-R1-Distill-Qwen-7B)에 SPIRAL을 적용해도 평균 2.0%의 개선을 이끌어낼 수 있습니다. 이러한 결과는 제로섬 게임이 전이 가능한 추론 능력을 자연스럽게 개발한다는 것을 보여주며, 자율적인 추론 개발을 위한 유망한 방향을 제시합니다.

English

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

SPIRAL: 제로섬 게임에서의 셀프 플레이가 다중 에이전트 다중 턴 강화 학습을 통해 추론을 유도하는 방법

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

초록

Support