ChatPaper.aiChatPaper

SkillFactory:用于学习认知行为的自蒸馏技术

SkillFactory: Self-Distillation For Learning Cognitive Behaviors

December 3, 2025
作者: Zayne Sprague, Jack Lu, Manya Wadhwa, Sedrick Keh, Mengye Ren, Greg Durrett
cs.AI

摘要

利用长链思维进行推理的模型需要运用多种认知技能,例如答案验证、回溯、交替方法重试等。已有研究表明,当基础语言模型展现出这些技能时,通过强化学习进一步训练该模型可使其学会运用这些技能。那么如何让模型掌握基础模型尚未展现的技能呢?我们的SkillFactory方法提出了一种微调策略,旨在强化学习前的监督微调阶段让模型初步掌握这些技能。该方法不依赖于从更强模型中进行知识蒸馏,而是通过重组模型自身生成的样本,以特定技能所需的格式提供训练数据。这些"银级"SFT轨迹可能并不完美,但能有效引导模型在强化学习阶段掌握技能。评估结果表明:(1)从SkillFactory的SFT初始化开始,尽管强化学习前性能较低,但有助于模型在强化学习后泛化至任务的更复杂变体;(2)模型确实运用了认知技能;(3)经过强化学习的SkillFactory模型相比经过强化学习的基础模型,在跨领域任务上表现出更强的抗退化能力。我们的研究表明,在强化学习前获得的归纳偏置有助于模型习得稳健的认知技能运用能力。
English
Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
PDF21December 5, 2025