穿越低谷：小型语言模型实现有效长链思维训练之路

摘要

長鏈思維（CoT）監督已成為增強語言模型推理能力的常見策略。儘管對大型模型有效，我們發現了一種現象，稱之為「長鏈思維退化」，即在有限的長鏈思維數據上訓練的小型語言模型（SLMs；參數<=3B）會經歷顯著的性能下降。通過對Qwen2.5、LLaMA3和Gemma3系列模型的大量實驗，我們證明這種退化在小型語言模型中普遍存在。在某些情況下，僅在8k個長鏈思維樣本上訓練的模型，其性能會下降高達75%，相較於微調前的原始性能。更令人驚訝的是，我們進一步觀察到，對於一些特別小的模型，即使在220k個長鏈思維樣本上訓練，也無法恢復或超越其微調前的原始性能。我們的分析將此效應歸因於錯誤累積：雖然更長的回應增加了多步推理的能力，但也放大了錯誤疊加的風險。此外，我們發現長鏈思維退化可能會對下游的強化學習（RL）產生負面影響，儘管這可以通過足夠規模的監督微調（SFT）來緩解。我們的研究結果挑戰了關於長鏈思維訓練對小型語言模型益處的常見假設，並為構建更有效的小規模推理模型提供了實用指導。

English

Long chain-of-thought (CoT) supervision has become a common strategy to enhance reasoning in language models. While effective for large models, we identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs; <=3B parameters) trained on limited long CoT data experience significant performance deterioration. Through extensive experiments on the Qwen2.5, LLaMA3 and Gemma3 families, we demonstrate that this degradation is widespread across SLMs. In some settings, models trained on only 8k long CoT examples lose up to 75% of their original performance before fine-tuning. Strikingly, we further observe that for some particularly small models, even training on 220k long CoT examples fails to recover or surpass their original performance prior to fine-tuning. Our analysis attributes this effect to error accumulation: while longer responses increase the capacity for multi-step reasoning, they also amplify the risk of compounding mistakes. Furthermore, we find that Long CoT Degradation may negatively impacts downstream reinforcement learning (RL), although this can be alleviated by sufficiently scaled supervised fine-tuning (SFT). Our findings challenge common assumptions about the benefits of long CoT training for SLMs and offer practical guidance for building more effective small-scale reasoning models.

穿越低谷：小型语言模型实现有效长链思维训练之路

Through the Valley: Path to Effective Long CoT Training for Small Language Models

摘要

Support