谷を越えて：小型言語モデルのための効果的な長い連鎖思考（CoT）トレーニングへの道

要旨

長い連鎖思考（CoT）の監視は、言語モデルの推論能力を向上させるための一般的な戦略となっています。大規模モデルでは効果的ですが、我々は「長いCoTの劣化」と呼ばれる現象を特定しました。これは、限られた長いCoTデータで訓練された小規模言語モデル（SLM; 3Bパラメータ以下）が、著しい性能低下を経験する現象です。Qwen2.5、LLaMA3、Gemma3ファミリーを用いた広範な実験を通じて、この劣化がSLM全体に広く見られることを実証しました。一部の設定では、8kの長いCoT例で訓練されたモデルが、ファインチューニング前の性能の最大75％を失うことが確認されました。さらに驚くべきことに、特に小規模なモデルでは、220kの長いCoT例で訓練しても、ファインチューニング前の性能を回復または超えることができない場合も観察されました。我々の分析によると、この現象はエラーの蓄積によるものです。長い応答は多段階の推論能力を高める一方で、誤りが連鎖的に増幅されるリスクも高まります。さらに、長いCoTの劣化は下流の強化学習（RL）に悪影響を及ぼす可能性がありますが、十分にスケーリングされた教師ありファインチューニング（SFT）によって緩和できることも明らかになりました。これらの発見は、SLMに対する長いCoT訓練の利点に関する一般的な前提に疑問を投げかけ、より効果的な小規模推論モデルを構築するための実践的な指針を提供します。

English

Long chain-of-thought (CoT) supervision has become a common strategy to enhance reasoning in language models. While effective for large models, we identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs; <=3B parameters) trained on limited long CoT data experience significant performance deterioration. Through extensive experiments on the Qwen2.5, LLaMA3 and Gemma3 families, we demonstrate that this degradation is widespread across SLMs. In some settings, models trained on only 8k long CoT examples lose up to 75% of their original performance before fine-tuning. Strikingly, we further observe that for some particularly small models, even training on 220k long CoT examples fails to recover or surpass their original performance prior to fine-tuning. Our analysis attributes this effect to error accumulation: while longer responses increase the capacity for multi-step reasoning, they also amplify the risk of compounding mistakes. Furthermore, we find that Long CoT Degradation may negatively impacts downstream reinforcement learning (RL), although this can be alleviated by sufficiently scaled supervised fine-tuning (SFT). Our findings challenge common assumptions about the benefits of long CoT training for SLMs and offer practical guidance for building more effective small-scale reasoning models.

谷を越えて：小型言語モデルのための効果的な長い連鎖思考（CoT）トレーニングへの道

Through the Valley: Path to Effective Long CoT Training for Small Language Models

要旨

Support