模型自教自學:在可學習性邊界上的推理探索
Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability
January 26, 2026
作者: Shobhita Sundaram, John Quan, Ariel Kwiatkowski, Kartik Ahuja, Yann Ollivier, Julia Kempe
cs.AI
摘要
模型能否學會突破自身的學習高原?在初始成功率較低、訓練信號匱乏的數據集上,微調大型推理模型的強化學習方法往往會陷入停滯。我們探究一個根本性問題:預訓練大語言模型能否利用潛在知識,為其無法解決的問題生成自動化課程?為此我們設計了SOAR框架:一種通過元強化學習發掘教學信號的自我改進機制。該框架中,模型的教師副本為學生副本生成合成問題,並根據後者在難題子集上的進步獲得獎勵。關鍵在於,SOAR將課程設計錨定於可量化的學生進展,而非內在的代理獎勵。我們在數學基準中最難子集(初始成功率0/128)上的實驗揭示三大發現:首先,通過激發預訓練模型生成有效墊腳石的潛在能力,可實現雙層元強化學習,從而在稀疏二元獎勵下開啟學習可能;其次,基於實際進展的獎勵機制優化先前LLM自我對弈中的內在獎勵方案,能穩定避免後者常見的失穩與多樣性崩塌問題;最後,對生成問題的分析表明,結構質量與問題明確性比解答正確性更關鍵地影響學習進程。這些結果預示著,生成有效墊腳石的能力並不以預先解決難題為前提,這為無需額外標註數據即可突破推理高原的原理性路徑奠定了基礎。
English
Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self-play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well-posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.