전략: 궤적 변조 게임 자가 대결을 통해 전이 가능한 추론 능력 학습하기

초록

게임은 전략적 계획, 확률적 추론, 적응적 의사 결정을 자연스럽게 요구하기 때문에 언어 모델의 일반적 추론 능력 개발에 매력적인 패러다임을 제공합니다. 그러나 기존의 자기 주도 학습(self-play) 접근법은 단순히 최종 게임 결과에만 의존하여, 게임 특화 휴리스틱과 전이 가능한 추론 패턴을 구분할 수 있는 메커니즘이 부족했습니다. 본 연구에서는 추론 전이의 두 가지 근본적 장벽, 즉 학습된 패턴이 게임 의미론에 고정되는 도메인 특이성(domain specificity)과 정적인 게임 환경이 점진적 추론 발전을 촉진하지 못하는 상황적 정체성(contextual stasis)을 해결하는 STRATAGEM을 제안합니다. STRATAGEM은 추론 전이 계수(Reasoning Transferability Coefficient)를 통해 추상적이고 도메인 독립적인 추론을 보여주는 경로를 선택적으로 강화하며, 추론 진화 보상(Reasoning Evolution Reward)을 통해 적응적 추론 발전을 장려합니다. 수학적 추론, 일반 추론, 코드 생성 벤치마크에서의 실험 결과, 특히 다단계 추론이 중요한 경쟁 수준의 수학 문제에서 특히 큰 향상을 보여주었습니다. ablation 연구와 인간 평가를 통해 두 구성 요소가 모두 전이 가능한 추론에 기여함을 확인했습니다.

English

Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision-making. However, existing self-play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics. We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer: domain specificity, where learned patterns remain anchored in game semantics, and contextual stasis, where static game contexts fail to cultivate progressive reasoning. STRATAGEM selectively reinforces trajectories exhibiting abstract, domain-agnostic reasoning through a Reasoning Transferability Coefficient, while incentivizing adaptive reasoning development via a Reasoning Evolution Reward. Experiments across mathematical reasoning, general reasoning, and code generation benchmarks demonstrate substantial improvements, with particularly strong gains on competition-level mathematics where multi-step reasoning is critical. Ablation studies and human evaluation confirm that both components contribute to transferable reasoning.

전략: 궤적 변조 게임 자가 대결을 통해 전이 가능한 추론 능력 학습하기

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

초록

Support