戦略：軌道変調ゲーム自己対戦による転移可能な推論の学習

要旨

ゲームは、戦略的計画、確率的推論、適応的意思決定を自然に要求するため、言語モデルにおける汎用的な推論能力の開発に説得力のあるパラダイムを提供する。しかし、既存の自己対戦アプローチは最終的なゲーム結果のみに依存しており、転移可能な推論パターンとゲーム固有のヒューリスティックを区別するメカニズムを欠いている。本研究では、推論転移における二つの根本的障壁、すなわち学習されたパターンがゲームの意味論に縛られる「ドメイン特異性」と、静的なゲーム環境が発展的な推論を育まない「文脈的停滞」に対処するSTRATAGEMを提案する。STRATAGEMは、推論転移係数を通じて抽象的でドメインに依存しない推論を示す軌道を選択的に強化するとともに、推論進化報酬により適応的推論の発達を促進する。数学的推論、汎用推論、コード生成のベンチマークによる実験では、特に多段階の推論が重要な競技レベルの数学において顕著な改善が確認された。 ablation研究と人間による評価は、両コンポーネントが転移可能な推論に寄与することを裏付けている。

English

Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision-making. However, existing self-play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics. We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer: domain specificity, where learned patterns remain anchored in game semantics, and contextual stasis, where static game contexts fail to cultivate progressive reasoning. STRATAGEM selectively reinforces trajectories exhibiting abstract, domain-agnostic reasoning through a Reasoning Transferability Coefficient, while incentivizing adaptive reasoning development via a Reasoning Evolution Reward. Experiments across mathematical reasoning, general reasoning, and code generation benchmarks demonstrate substantial improvements, with particularly strong gains on competition-level mathematics where multi-step reasoning is critical. Ablation studies and human evaluation confirm that both components contribute to transferable reasoning.

戦略：軌道変調ゲーム自己対戦による転移可能な推論の学習

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

要旨

Support