SPICE: 말뭉치 환경에서의 자기 대결을 통한 추론 능력 향상

초록

자기 발전 시스템은 지속적인 적응을 위해 환경과의 상호작용이 필요합니다. 본 연구에서는 단일 모델이 두 가지 역할(대규모 코퍼스에서 문서를 탐색하여 다양한 추론 과제를 생성하는 도전자와 이를 해결하는 추론자)을 수행하는 강화 학습 프레임워크인 SPICE(Self-Play In Corpus Environments)를 소개합니다. 적대적 역학을 통해 도전자는 추론자의 능력 한계를 넘어서는 자동화된 커리큘럼을 생성하며, 코퍼스 접지(grounding)는 지속적 발전에 필요한 풍부하고 거의 고갈되지 않는 외부 신호를 제공합니다. 제한된 이점만 제공하는 기존의 비접지(non-grounded) 자기 대결 방식과 달리, SPICE는 여러 모델 패밀리에서 수학적 추론(+8.9%) 및 일반 추론(+9.8%) 벤치마크에 걸쳐 일관된 성능 향상을 달성했습니다. 우리의 분석은 문서 접지가 SPICE에서 점점 더 어려운 목표를 지속적으로 생성하고 달성하여 꾸준한 자기 발전을 가능하게 하는 핵심 요소임을 보여줍니다.

English

Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner's capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.

SPICE: 말뭉치 환경에서의 자기 대결을 통한 추론 능력 향상

SPICE: Self-Play In Corpus Environments Improves Reasoning

초록

Support