穿越低谷：小型语言模型实现高效长链思维训练之路

摘要

长链思维（CoT）监督已成为增强语言模型推理能力的常见策略。尽管这种方法对大型模型有效，但我们发现了一种称为“长链思维退化”的现象，即在有限的长链思维数据上训练的小型语言模型（SLMs；参数≤3B）会经历显著的性能下降。通过对Qwen2.5、LLaMA3和Gemma3系列模型的大量实验，我们证明了这种退化在SLMs中普遍存在。在某些情况下，仅用8k个长链思维示例训练的模型会损失高达75%的微调前性能。更为引人注目的是，我们进一步观察到，对于一些特别小的模型，即使使用220k个长链思维示例进行训练，也无法恢复或超越其微调前的原始性能。我们的分析将这一效应归因于错误累积：虽然更长的响应增加了多步推理的能力，但也放大了错误叠加的风险。此外，我们发现长链思维退化可能对下游强化学习（RL）产生负面影响，尽管通过足够规模的监督微调（SFT）可以缓解这一问题。我们的研究结果挑战了关于长链思维训练对SLMs益处的常见假设，并为构建更有效的小规模推理模型提供了实用指导。

English

Long chain-of-thought (CoT) supervision has become a common strategy to enhance reasoning in language models. While effective for large models, we identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs; <=3B parameters) trained on limited long CoT data experience significant performance deterioration. Through extensive experiments on the Qwen2.5, LLaMA3 and Gemma3 families, we demonstrate that this degradation is widespread across SLMs. In some settings, models trained on only 8k long CoT examples lose up to 75% of their original performance before fine-tuning. Strikingly, we further observe that for some particularly small models, even training on 220k long CoT examples fails to recover or surpass their original performance prior to fine-tuning. Our analysis attributes this effect to error accumulation: while longer responses increase the capacity for multi-step reasoning, they also amplify the risk of compounding mistakes. Furthermore, we find that Long CoT Degradation may negatively impacts downstream reinforcement learning (RL), although this can be alleviated by sufficiently scaled supervised fine-tuning (SFT). Our findings challenge common assumptions about the benefits of long CoT training for SLMs and offer practical guidance for building more effective small-scale reasoning models.

穿越低谷：小型语言模型实现高效长链思维训练之路

Through the Valley: Path to Effective Long CoT Training for Small Language Models

摘要

Support