反思推理指令微調中的泛化問題：針對優化策略、數據與模型能力的條件性分析

摘要

當前大型語言模型後訓練的主流觀點認為：監督式微調側重記憶，而強化學習側重泛化。我們針對帶有長思維鏈監督的推理式監督式微調重新檢視這一論斷，發現跨領域泛化能力並非缺失而是有條件的，其受優化動態、訓練資料與基礎模型能力共同影響。部分已報導的失敗案例實為優化不足的假象：跨領域效能會先下降後回升，並隨訓練延長持續改善（呈現「先抑後揚」模式），因此短期訓練的檢查點可能低估泛化能力。資料質量與結構皆具關鍵影響：低質量解答普遍損害泛化能力，而經過驗證的長思維鏈軌跡則能帶來穩定的跨領域增益。模型能力至關重要：強模型即使從簡易算術遊戲中也能內化可遷移的流程模式（如回溯策略），而弱模型僅能模仿表面冗長性。然而這種泛化具有不對稱性：推理能力提升的同時安全性卻惡化，這將問題從「推理式監督式微調能否泛化」轉變為「在何種條件下以何種代價實現泛化」。

English

A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.

反思推理指令微調中的泛化問題：針對優化策略、數據與模型能力的條件性分析

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

摘要

Support