ChatPaper.aiChatPaper

SCOPE:透過共同演化策略實現開放式任務的自我對弈

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

May 29, 2026
作者: Wai-Chung Kwan, Aryo Pradipta Gema, Joshua Ong Jun Leang, Pasquale Minervini
cs.AI

摘要

自我對弈可以在無外部監督下訓練語言模型。然而,現有方法需要可規則驗證的答案,使得開放式任務依賴於精心設計的提示或前沿模型評判。我們提出SCOPE,一種針對開放式任務的無資料自我對弈框架,該框架共同演化兩種策略:一個生成文件基礎任務的挑戰者,以及一個透過多輪檢索回答問題的解決者。初始模型的凍結副本作為自我評判者,負責根據原始文件撰寫任務特定評分標準,並據此對解決者回應進行評分。在三個7-8B指令微調模型(Qwen2.5、Qwen3、OLMo-3)上,SCOPE在八個基準測試中將開放式任務表現提升最多+10.4分,並匹配或超越在約9K個精心設計提示上訓練的GRPO_data。儘管僅在開放式任務上訓練,SCOPE也在七個保留的基準測試中將短形式問答表現提升最多+13.8分,在所有三個模型上均超越GRPO_data。消融實驗顯示,共同演化挑戰者對於保持任務接近解決者的能力前沿是必要的;效能提升來自檢索與合成的雙重改進,兩者相對貢獻因任務而異;而評分標準生成品質是自我評判的瓶頸。
English
Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.