DARC: 대규모 언어 모델 진화를 위한 비결합 비대칭 추론 커리큘럼

초록

대규모 언어 모델을 활용한 자기 대전은 자기 향상 인공 지능을 실현하기 위한 유망한 패러다임으로 부상했습니다. 그러나 기존 자기 대전 프레임워크는 (i) 질문자에 대한 솔버 의존적 보상 피드백으로 인한 비정적 목표와 (ii) 솔버를 지도하는 데 사용되는 자체 생성 의사 레이블의 부트스트래핑 오류로 인해 최적화 불안정성을 겪는 경우가 많습니다. 이러한 문제를 완화하기 위해 우리는 자기 진화 과정을 안정화하는 2단계 프레임워크인 DARC(Decoupled Asymmetric Reasoning Curriculum)를 제안합니다. 첫째, 명시적 난이도와 외부 코퍼스를 조건으로 하여 질문자가 난이도가 보정된 질문을 합성하도록 훈련합니다. 둘째, 문서 증강 교사가 문서 접근 권한이 없는 학생 솔버를 지도하는 고품질 의사 레이블을 생성하는 비대칭 자기 지식 증류 메커니즘으로 솔버를 훈련합니다. 실험 결과에 따르면 DARC는 모델에 독립적으로 적용 가능하며, 9개의 추론 벤치마크와 3개의 백본 모델에서 평균 10.9점의 성능 향상을 보였습니다. 또한 DARC는 모든 베이스라인을 일관되게 능가하며 인간 주석에 의존하지 않으면서 완전 지도 모델의 성능에 근접했습니다. 코드는 https://github.com/RUCBM/DARC에서 확인할 수 있습니다.

English

Self-play with large language models has emerged as a promising paradigm for achieving self-improving artificial intelligence. However, existing self-play frameworks often suffer from optimization instability, due to (i) non-stationary objectives induced by solver-dependent reward feedback for the Questioner, and (ii) bootstrapping errors from self-generated pseudo-labels used to supervise the Solver. To mitigate these challenges, we introduce DARC (Decoupled Asymmetric Reasoning Curriculum), a two-stage framework that stabilizes the self-evolution process. First, we train the Questioner to synthesize difficulty-calibrated questions, conditioned on explicit difficulty levels and external corpora. Second, we train the Solver with an asymmetric self-distillation mechanism, where a document-augmented teacher generates high-quality pseudo-labels to supervise the student Solver that lacks document access. Empirical results demonstrate that DARC is model-agnostic, yielding an average improvement of 10.9 points across nine reasoning benchmarks and three backbone models. Moreover, DARC consistently outperforms all baselines and approaches the performance of fully supervised models without relying on human annotations.The code is available at https://github.com/RUCBM/DARC.

DARC: 대규모 언어 모델 진화를 위한 비결합 비대칭 추론 커리큘럼

DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution

초록

Support