SkillFactory:通过自蒸馏学习认知行为
SkillFactory: Self-Distillation For Learning Cognitive Behaviors
December 3, 2025
作者: Zayne Sprague, Jack Lu, Manya Wadhwa, Sedrick Keh, Mengye Ren, Greg Durrett
cs.AI
摘要
利用长思维链的推理模型需要运用多种认知技能,如答案验证、回溯、交替方法重试等。已有研究表明,当基础语言模型展现出这些技能时,通过强化学习进一步训练可使其学会运用这些技能。但如何让模型掌握基础模型尚未展现的技能?我们的SkillFactory方法通过在强化学习前的监督微调阶段进行模型精调,使其初步掌握这些技能。该方法不依赖于从更强模型的蒸馏,而是对模型自身生成的样本进行重构,以技能所需的格式提供训练数据。这些"银级"SFT轨迹可能不够完美,但能有效引导模型在强化学习阶段掌握技能。评估表明:(1)从SkillFactory的SFT初始化出发,尽管强化学习前性能较低,但有助于模型泛化至任务的高难度变体;(2)模型确实运用了认知技能;(3)经强化学习的SkillFactory模型相比基础模型在领域外任务上表现出更强的抗退化能力。我们的研究表明,强化学习前获得的归纳偏置有助于模型掌握稳健的认知技能运用。
English
Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.