ChatPaper.aiChatPaper

技能導向適應性訓練

Skill-Targeted Adaptive Training

October 11, 2025
作者: Yinghui He, Abhishek Panigrahi, Yong Lin, Sanjeev Arora
cs.AI

摘要

语言模型在通过常规的监督微调(SFT)对与其训练集(如MATH)相似的数据进行训练时,往往表现出极少甚至没有改进(即“饱和”)。我们引入了一种新的微调策略——STAT,利用更强的大型语言模型(LLM)的元认知能力作为教师来训练学生模型。教师使用任务数据集创建任务所需技能列表,并为每个数据点标注其所需技能(Didolkar等,2024)。通过监控学生的答案,教师创建了一个“缺失技能档案”,追踪学生在回答中未能应用每项技能的频率。我们利用这一概念以两种方式构建修改后的训练集。在STAT-Sel中,教师使用现有的训练示例集,但根据缺失技能档案自适应地重新加权。在STAT-Syn中,教师则合成涉及缺失技能的额外示例。在Llama和Qwen模型上的大量实验中,我们的方法在MATH上实现了高达7.5%的提升,而SFT仅带来有限的增益。此外,STAT在分布外基准测试(如AIME24/25、AMC23等)上的表现平均提升了4.6%。关键的是,我们发现STAT与通过GRPO进行的强化学习(RL)是互补的(Shao等,2024):在模型使用STAT解决技能差距后,GRPO继续带来进一步的增益。我们得出结论,针对技能的适应性训练应能广泛提升当前的训练流程。我们的代码可在以下网址获取:https://github.com/princeton-pli/STAT。
English
Language models often show little to no improvement (i.e., "saturation") when trained via vanilla supervised fine-tuning (SFT) on data similar to what they saw in their training set (e.g., MATH). We introduce a new fine-tuning strategy, STAT, to train such a student model by using the metacognition ability of a stronger large language model (LLM) as the teacher. The teacher uses the task dataset to create a list of skills needed for the task, and then labels each data point with its required skills (Didolkar et al., 2024). By monitoring the student's answers, the teacher creates a Missing-Skill-Profile for the student, tracking how often they failed to apply each skill in their responses. We use this idea to build a modified training set in one of two ways. In STAT-Sel, the teacher uses an existing set of training examples but adaptively reweights them according to the Missing-Skill-Profile. In STAT-Syn, the teacher synthesizes additional examples involving missing skills. Across extensive experiments on Llama and Qwen models, our methods yield improvements of up to 7.5% on MATH, whereas SFT provides only limited gains. Furthermore, STAT enhances performance on out-of-distribution benchmarks (e.g., AIME24/25, AMC23, etc.) by an average of 4.6%. Crucially, we find that STAT is complementary to RL via GRPO (Shao et al., 2024): after the model is improved using STAT to address skill gaps, GRPO continues to add further gains. We conclude that skill-targeted adaptive training should broadly improve current training pipelines. Our code is available at: https://github.com/princeton-pli/STAT.
PDF92October 14, 2025