DARC:面向大语言模型进化的解耦式非对称推理课程
DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution
January 20, 2026
作者: Shengda Fan, Xuyan Ye, Yankai Lin
cs.AI
摘要
基于大语言模型的自我博弈已成为实现自我改进人工智能的重要范式。然而现有自博弈框架常因两大问题导致优化不稳定:(i) 提问者依赖求解器反馈的奖励目标存在非平稳性;(ii)求解器使用自生成伪标签会引入自举误差。为应对这些挑战,我们提出解耦式非对称推理课程框架DARC,通过两阶段训练稳定自进化过程:首先基于显式难度分级和外部语料库,训练提问者生成难度可控的问题;随后采用非对称自蒸馏机制,让具备文档检索能力的教师模型生成高质量伪标签,指导无文档访问权限的学生求解器。实验表明DARC具有模型无关性,在三大骨干模型和九个推理基准测试中平均提升10.9个点,持续超越所有基线模型,并在无需人工标注的情况下接近全监督模型性能。代码已开源于https://github.com/RUCBM/DARC。
English
Self-play with large language models has emerged as a promising paradigm for achieving self-improving artificial intelligence. However, existing self-play frameworks often suffer from optimization instability, due to (i) non-stationary objectives induced by solver-dependent reward feedback for the Questioner, and (ii) bootstrapping errors from self-generated pseudo-labels used to supervise the Solver. To mitigate these challenges, we introduce DARC (Decoupled Asymmetric Reasoning Curriculum), a two-stage framework that stabilizes the self-evolution process. First, we train the Questioner to synthesize difficulty-calibrated questions, conditioned on explicit difficulty levels and external corpora. Second, we train the Solver with an asymmetric self-distillation mechanism, where a document-augmented teacher generates high-quality pseudo-labels to supervise the student Solver that lacks document access. Empirical results demonstrate that DARC is model-agnostic, yielding an average improvement of 10.9 points across nine reasoning benchmarks and three backbone models. Moreover, DARC consistently outperforms all baselines and approaches the performance of fully supervised models without relying on human annotations.The code is available at https://github.com/RUCBM/DARC.