ChatPaper.aiChatPaper

模型自我教学:在可学习性边缘的推理探索

Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

January 26, 2026
作者: Shobhita Sundaram, John Quan, Ariel Kwiatkowski, Kartik Ahuja, Yann Ollivier, Julia Kempe
cs.AI

摘要

模型能否学会突破自身的学习瓶颈?在初始成功率较低、训练信号匮乏的数据集上,微调大型推理模型的强化学习方法往往会陷入停滞。我们探究了一个根本性问题:预训练大语言模型能否利用潜在知识,为其无法解决的难题自动生成课程?为此,我们设计了SOAR框架:一种通过元强化学习挖掘教学信号的自改进框架。该框架中,模型的教师副本为学生副本生成合成问题,并根据其在少量难题子集上的进步获得奖励。关键在于,SOAR将课程设计锚定于可量化的学生进展,而非内在的代理奖励。我们在数学基准中最难子集(初始成功率0/128)上的研究揭示了三大核心发现:首先,通过激活预训练模型生成有效阶梯式问题的潜在能力,可实现双层级元强化学习,从而在稀疏二元奖励环境下开启学习进程;其次,基于实际进展的奖励机制优于先前LLM自我对弈中使用的内在奖励方案,能可靠避免后者常见的不稳定性和多样性崩溃问题;最后,对生成问题的分析表明,结构质量与问题表述的清晰度对学习进展的影响比解题正确性更为关键。我们的研究结果表明,生成有效阶梯式问题的能力并不以预先具备解决难题的能力为前提,这为无需额外标注数据即可突破推理瓶颈开辟了理论路径。
English
Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self-play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well-posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.
PDF221January 28, 2026