反思推理SFT中的泛化性:基于优化、数据与模型能力的条件分析
Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability
April 8, 2026
作者: Qihan Ren, Peng Wang, Ruikun Cai, Shuai Shao, Dadi Guo, Yuejin Xie, Yafu Li, Quanshi Zhang, Xia Hu, Jing Shao, Dongrui Liu
cs.AI
摘要
当前大语言模型后训练领域的主流观点认为:监督微调(SFT)导致记忆化,而强化学习(RL)促进泛化。我们针对带有长思维链(CoT)监督信号的推理式SFT重新审视这一论断,发现跨领域泛化并非不存在,而是有条件存在的——其表现同时受优化动态、训练数据和基座模型能力共同塑造。部分已报道的失败案例实为欠优化产物:跨领域性能会先下降后恢复,并随着训练延长持续提升(呈现"先抑后扬"模式),因此短期训练检查点可能低估泛化能力。数据质量与结构均至关重要:低质量解决方案普遍损害泛化能力,而经过验证的长CoT轨迹能带来持续的跨领域增益。模型能力具有决定性作用:强大模型即使从简单的算术游戏中也能内化可迁移的流程模式(如回溯策略),而较弱模型仅模仿表面化的冗长表达。然而这种泛化具有不对称性:推理能力提升的同时安全性会下降,从而将问题从"推理式SFT是否泛化"重构为"在何种条件下、以何种代价实现泛化"。
English
A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.