一般化を再考する推論SFT：最適化、データ、モデル能力に関する条件的分析

要旨

大規模言語モデル(LLM)の学習後調整における支配的な通説として、「教師ありファインチューニング(SFT)は記憶化を促進し、強化学習(RL)は一般化を促進する」という主張がある。本研究では、長い思考連鎖(CoT)を監督信号とした推論SFTにおいてこの主張を再検証し、分野横断的な一般化が欠如しているのではなく、最適化ダイナミクス・学習データ・基盤モデルの能力によって共同で形成される条件的な現象であることを明らかにする。一部の報告されている失敗事例は最適化不足による人為的所見である：分野横断的性能は、訓練の拡大に伴い一時的に悪化した後、回復・改善する（ディップ＆リカバリーパターン）ため、短期訓練のチェックポイントでは一般化能力を過小評価しがちである。データの質と構造の両方が重要である：低品質な解答は一般化を広範に損なう一方、検証済みの長いCoT軌跡は一貫した分野横断的効果をもたらす。モデル能力が決定的に重要である：強力なモデルは単純な算数ゲームからでも（バックトラッキングなどの）転移可能な手続き的パターンを内在化するが、弱いモデルは表面的な冗長性を模倣するにとどまる。ただしこの一般化は非対称的であり、推論能力が向上する一方で安全性は低下する。したがって問いは「推論SFTが一般化するか否か」から、「どの条件下で、どのような代償を伴って一般化するか」へと再定義されるべきである。

English

A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.

一般化を再考する推論SFT：最適化、データ、モデル能力に関する条件的分析

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

要旨

Support