SCOPE: オープンエンドタスクのための共進化的方策による自己対戦

要旨

自己対戦により、言語モデルは外部からの監督なしで訓練することができる。しかし、既存の手法は規則で検証可能な回答を必要とするため、自由形式タスクは厳選されたプロンプトやフロンティアモデルによる評価者に依存したままである。本稿では、自由形式タスク向けのデータ不要な自己対戦フレームワークSCOPEを提案する。SCOPEでは、文書に基づくタスクを生成するChallengerと、マルチターン検索を通じてそれらに回答するSolverという2つのポリシーが共進化する。初期モデルの凍結コピーが自己評価者として機能し、ソース文書からタスク固有の評価基準を作成し、それに照らしてSolverの応答を採点する。3つの7B～8Bの指示チューニング済みモデル（Qwen2.5、Qwen3、OLMo-3）において、SCOPEは8つのベンチマークで最大+10.4ポイントの自由形式性能の向上を達成し、約9,000件の厳選プロンプトで学習したGRPO_dataと同等またはそれを上回る。自由形式タスクのみで訓練されたにもかかわらず、SCOPEは保持された7つのショートフォームQAベンチマークにおいても最大+13.8ポイントの改善を示し、3モデルすべてでGRPO_dataを凌駕した。アブレーション実験により、Challengerの共進化がタスクをSolverのフロンティア付近に維持するために必要であること、性能向上は検索と合成の両方の改善によるものでありその相対的な寄与はタスクによって異なること、そして自己評価におけるボトルネックは評価基準生成の品質であることが示された。

English

Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.