ChatPaper.aiChatPaper

SCOPE:通过共同演化策略实现开放式任务的自我博弈

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

May 29, 2026
作者: Wai-Chung Kwan, Aryo Pradipta Gema, Joshua Ong Jun Leang, Pasquale Minervini
cs.AI

摘要

自我对弈可以在无外部监督的情况下训练语言模型。然而,现有方法依赖于有规则可验证的答案,使得开放式任务完全依赖精心设计的提示或前沿模型评判。我们提出SCOPE,这是一种面向开放式任务的无数据自我对弈框架,能够共同进化两个策略:一个生成基于文档任务的挑战者,以及一个通过多轮检索回答这些任务的求解器。初始模型的冻结副本作为自我评判器,根据源文档编写任务特定的评分标准,并据此对求解器的回答进行评分。在三种7-8B指令微调模型(Qwen2.5、Qwen3、OLMo-3)上,SCOPE在八个基准测试中将开放式任务性能最高提升+10.4个百分点,并在基于约9K精心设计提示训练的GRPO_data上达到或超越其表现。尽管仅在开放式任务上训练,SCOPE还在七个保留的基准测试中将短格式问答性能最高提升+13.8个百分点,在所有三种模型上均超越GRPO_data。消融实验表明,共同进化挑战者对于保持任务接近求解器能力边界是必要的;性能提升来自检索与综合能力的改进,且其相对贡献因任务而异;此外,评分标准生成质量是自我评判的瓶颈所在。
English
Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.