反思推理SFT中的泛化性：基于优化、数据与模型能力的条件分析

摘要

当前大语言模型后训练领域的主流观点认为：监督微调（SFT）导致记忆化，而强化学习（RL）促进泛化。我们针对带有长思维链（CoT）监督信号的推理式SFT重新审视这一论断，发现跨领域泛化并非不存在，而是有条件存在的——其表现同时受优化动态、训练数据和基座模型能力共同塑造。部分已报道的失败案例实为欠优化产物：跨领域性能会先下降后恢复，并随着训练延长持续提升（呈现"先抑后扬"模式），因此短期训练检查点可能低估泛化能力。数据质量与结构均至关重要：低质量解决方案普遍损害泛化能力，而经过验证的长CoT轨迹能带来持续的跨领域增益。模型能力具有决定性作用：强大模型即使从简单的算术游戏中也能内化可迁移的流程模式（如回溯策略），而较弱模型仅模仿表面化的冗长表达。然而这种泛化具有不对称性：推理能力提升的同时安全性会下降，从而将问题从"推理式SFT是否泛化"重构为"在何种条件下、以何种代价实现泛化"。

English

A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.

反思推理SFT中的泛化性：基于优化、数据与模型能力的条件分析

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

摘要

Support