技能导向的自适应训练
Skill-Targeted Adaptive Training
October 11, 2025
作者: Yinghui He, Abhishek Panigrahi, Yong Lin, Sanjeev Arora
cs.AI
摘要
语言模型在通过常规的监督微调(SFT)对与训练集(如MATH)相似的数据进行训练时,往往表现出极少甚至没有改进(即“饱和”)。我们引入了一种新的微调策略——STAT,利用更强的大型语言模型(LLM)的元认知能力作为教师来训练学生模型。教师利用任务数据集创建完成任务所需的技能列表,并为每个数据点标注其所需技能(Didolkar等,2024)。通过监控学生的回答,教师为学生创建一个缺失技能档案,追踪他们在回答中未能应用每项技能的频率。我们运用这一理念,通过以下两种方式之一构建修改后的训练集。在STAT-Sel中,教师使用现有的训练示例集,但根据缺失技能档案自适应地重新加权。在STAT-Syn中,教师合成涉及缺失技能的额外示例。在Llama和Qwen模型上的大量实验中,我们的方法在MATH上实现了高达7.5%的提升,而SFT仅带来有限的增益。此外,STAT在分布外基准测试(如AIME24/25、AMC23等)上的表现平均提升了4.6%。关键的是,我们发现STAT与通过GRPO进行的强化学习(RL)具有互补性(Shao等,2024):在模型通过STAT解决技能差距后,GRPO继续带来进一步的提升。我们得出结论,针对技能的适应性训练应能广泛改进当前的训练流程。我们的代码可在以下网址获取:https://github.com/princeton-pli/STAT。
English
Language models often show little to no improvement (i.e., "saturation") when
trained via vanilla supervised fine-tuning (SFT) on data similar to what they
saw in their training set (e.g., MATH). We introduce a new fine-tuning
strategy, STAT, to train such a student model by using the metacognition
ability of a stronger large language model (LLM) as the teacher. The teacher
uses the task dataset to create a list of skills needed for the task, and then
labels each data point with its required skills (Didolkar et al., 2024). By
monitoring the student's answers, the teacher creates a Missing-Skill-Profile
for the student, tracking how often they failed to apply each skill in their
responses. We use this idea to build a modified training set in one of two
ways. In STAT-Sel, the teacher uses an existing set of training examples but
adaptively reweights them according to the Missing-Skill-Profile. In STAT-Syn,
the teacher synthesizes additional examples involving missing skills. Across
extensive experiments on Llama and Qwen models, our methods yield improvements
of up to 7.5% on MATH, whereas SFT provides only limited gains. Furthermore,
STAT enhances performance on out-of-distribution benchmarks (e.g., AIME24/25,
AMC23, etc.) by an average of 4.6%. Crucially, we find that STAT is
complementary to RL via GRPO (Shao et al., 2024): after the model is improved
using STAT to address skill gaps, GRPO continues to add further gains. We
conclude that skill-targeted adaptive training should broadly improve current
training pipelines. Our code is available at:
https://github.com/princeton-pli/STAT.