在长链思维监督微调中,数据重复优于数据扩展
Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning
February 11, 2026
作者: Dawid J. Kopiczko, Sagar Vaze, Tijmen Blankevoort, Yuki M. Asano
cs.AI
摘要
基於思維鏈數據的監督式微調是推理語言模型訓練後階段的重要步驟。標準機器學習直覺認為,使用更多獨特訓練樣本可提升泛化能力。但反直覺的是,我們發現重複訓練能提升SFT效果:在固定更新預算下,對較小數據集進行多輪訓練的效果優於對大數據集的單輪訓練。在AIME'24/25和GPQA基準測試中,Olmo3-7B模型使用400個樣本訓練128輪的表現,較使用51200個樣本訓練1輪的對照組提升12-26個百分點,且未出現災難性遺忘。我們發現訓練標記準確率能可靠指示重複訓練的飽和點;當達到完全記憶後,增加訓練輪次帶來的改善會趨於平緩,此模式在所有設定中均一致。這些發現為推理SFT提供了實用方法——通過監控標記準確率作為停止準則來擴展訓練輪次,可替代成本高昂的無定向數據擴充。我們將「完全記憶與泛化能力提升同步出現」的重複訓練優勢現象,提出為理解大型語言模型訓練動力學的新開放性問題。
English
Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.