ChatPaper.aiChatPaper

DARC:面向大语言模型进化的解耦式非对称推理课程

DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution

January 20, 2026
作者: Shengda Fan, Xuyan Ye, Yankai Lin
cs.AI

摘要

基於大型語言模型的自我博弈已成為實現自我改進人工智能的潛力範式。然而現有自我博弈框架常因兩大問題導致優化不穩定:(1)問句生成模組因依賴求解器反饋而產生的目標非平穩性;(2)求解器使用自生成偽標籤導致的自舉誤差。為解決這些挑戰,我們提出解耦非對稱推理課程(DARC),該兩階段框架通過以下方式穩定自我演化進程:首先訓練問句生成模組根據顯式難度分級與外部語料生成難度可校準的問題,隨後採用非對稱自蒸餾機制訓練求解器——具備文獻增強能力的教師模型生成高質量偽標籤,用以監督無文獻訪問權限的學生求解器。實證結果表明DARC具備模型無關性,在九項推理基準測試與三種骨幹模型中平均提升10.9個性能點。此外,DARC始終優於所有基準模型,且無需人工標註即可逼近全監督模型性能。代碼已開源於https://github.com/RUCBM/DARC。
English
Self-play with large language models has emerged as a promising paradigm for achieving self-improving artificial intelligence. However, existing self-play frameworks often suffer from optimization instability, due to (i) non-stationary objectives induced by solver-dependent reward feedback for the Questioner, and (ii) bootstrapping errors from self-generated pseudo-labels used to supervise the Solver. To mitigate these challenges, we introduce DARC (Decoupled Asymmetric Reasoning Curriculum), a two-stage framework that stabilizes the self-evolution process. First, we train the Questioner to synthesize difficulty-calibrated questions, conditioned on explicit difficulty levels and external corpora. Second, we train the Solver with an asymmetric self-distillation mechanism, where a document-augmented teacher generates high-quality pseudo-labels to supervise the student Solver that lacks document access. Empirical results demonstrate that DARC is model-agnostic, yielding an average improvement of 10.9 points across nine reasoning benchmarks and three backbone models. Moreover, DARC consistently outperforms all baselines and approaches the performance of fully supervised models without relying on human annotations.The code is available at https://github.com/RUCBM/DARC.
PDF51January 22, 2026