反思推理指令微調中的泛化問題:針對優化策略、數據與模型能力的條件性分析
Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability
April 8, 2026
作者: Qihan Ren, Peng Wang, Ruikun Cai, Shuai Shao, Dadi Guo, Yuejin Xie, Yafu Li, Quanshi Zhang, Xia Hu, Jing Shao, Dongrui Liu
cs.AI
摘要
當前大型語言模型後訓練的主流觀點認為:監督式微調側重記憶,而強化學習側重泛化。我們針對帶有長思維鏈監督的推理式監督式微調重新檢視這一論斷,發現跨領域泛化能力並非缺失而是有條件的,其受優化動態、訓練資料與基礎模型能力共同影響。部分已報導的失敗案例實為優化不足的假象:跨領域效能會先下降後回升,並隨訓練延長持續改善(呈現「先抑後揚」模式),因此短期訓練的檢查點可能低估泛化能力。資料質量與結構皆具關鍵影響:低質量解答普遍損害泛化能力,而經過驗證的長思維鏈軌跡則能帶來穩定的跨領域增益。模型能力至關重要:強模型即使從簡易算術遊戲中也能內化可遷移的流程模式(如回溯策略),而弱模型僅能模仿表面冗長性。然而這種泛化具有不對稱性:推理能力提升的同時安全性卻惡化,這將問題從「推理式監督式微調能否泛化」轉變為「在何種條件下以何種代價實現泛化」。
English
A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.