SCOPE: 개방형 과제를 위한 정책 공진화를 통한 자가 플레이

초록

자기 대결(self-play)은 외부 감독 없이 언어 모델을 훈련할 수 있다. 그러나 기존 방법은 규칙으로 검증 가능한 답변을 요구하므로, 개방형 작업은 엄선된 프롬프트나 최첨단 모델 평가자에 의존하게 된다. 본 논문에서는 개방형 작업을 위한 데이터 없는 자기 대결 프레임워크인 SCOPE를 소개한다. SCOPE는 문서 기반 작업을 생성하는 도전자(Challenger)와 다중 회차 검색을 통해 해당 작업에 답변하는 해결자(Solver)라는 두 정책을 공동 진화(co-evolve)시킨다. 초기 모델의 고정 복사본은 자기 판정자(self-judge) 역할을 하며, 원본 문서로부터 작업별 평가 기준(rubric)을 작성하고 이 기준에 따라 해결자의 응답을 평가한다. 세 가지 7-8B 명령어 튜닝 모델(Qwen2.5, Qwen3, OLMo-3)에서 SCOPE는 여덟 개 벤치마크에서 개방형 성능을 최대 +10.4포인트 향상시켰으며, 약 9K개의 엄선된 프롬프트로 훈련된 GRPO_data와 동등하거나 이를 능가했다. 개방형 작업만으로 훈련되었음에도 불구하고, SCOPE는 일곱 개의 비공개(held-out) 단답형 QA 벤치마크에서 최대 +13.8포인트 향상되어 세 모델 모두에서 GRPO_data를 능가했다. 절제 실험(ablation) 결과, 도전자를 공동 진화시키는 것이 작업을 해결자의 최전선 근처에 유지하는 데 필수적이며, 성능 향상은 검색과 합성 모두의 개선에서 비롯되며 상대적 기여도는 작업에 따라 달라지고, 자기 판정에 있어 평가 기준 생성 품질이 병목임을 보여준다.

English

Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.