LLM（Large Language Models）は、デモンストレーションから推論を容易に学習できます。コンテンツではなく、構造が重要です！

要旨

大規模推論モデル（LRM）は、反射、バックトラッキング、自己検証を組み込んだ長い思考の連鎖（Long CoT）に従うことで、複雑な推論問題に取り組みます。ただし、Long CoTを引き出すためのトレーニング技術とデータ要件は依然として理解されていません。本研究では、大規模言語モデル（LLM）が、データ効率の良い教師付きファインチューニング（SFT）とパラメータ効率の良い低ランク適応（LoRA）を通じて、効果的にLong CoT推論を学習できることがわかりました。17kの長いCoTトレーニングサンプルだけで、Qwen2.5-32B-Instructモデルは、AIME 2024で56.7％（+40.0％）、LiveCodeBenchで57.0％（+8.1％）など、広範囲の数学およびコーディングベンチマークで著しい改善を達成し、専用のo1-previewモデルのスコア44.6％および59.1％に匹敵します。さらに、Long CoTの構造が学習プロセスにおいて重要であることがわかりましたが、個々の推論ステップの内容はほとんど影響を与えません。不適切なサンプルでトレーニングしたり、推論キーワードを削除したりするなど、内容に影響を与える摂動はパフォーマンスにほとんど影響しません。それに対して、Long CoT内の論理的整合性を乱す構造的変更（シャッフルや推論ステップの削除など）は、精度を著しく低下させます。例えば、不正解の回答を含むLong CoTサンプルでトレーニングされたモデルは、完全に正しいサンプルでトレーニングした場合と比較して、わずか3.2％の精度低下にとどまります。これらの知見は、LLMにおける推論能力を引き出す方法についての理解を深め、次世代の推論モデルを効率的にトレーニングする際の重要な考慮事項を示しています。これは、以前にリリースされたSky-T1-32B-Previewモデルの学術論文です。コードはhttps://github.com/NovaSky-AI/SkyThoughtで入手可能です。

English

Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA). With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0% (+8.1%) on LiveCodeBench, competitive to the proprietary o1-preview model's score of 44.6% and 59.1%. More importantly, we find that the structure of Long CoT is critical to the learning process, whereas the content of individual reasoning steps has minimal impact. Perturbations affecting content, such as training on incorrect samples or removing reasoning keywords, have little impact on performance. In contrast, structural modifications that disrupt logical consistency in the Long CoT, such as shuffling or deleting reasoning steps, significantly degrade accuracy. For example, a model trained on Long CoT samples with incorrect answers still achieves only 3.2% lower accuracy compared to training with fully correct samples. These insights deepen our understanding of how to elicit reasoning capabilities in LLMs and highlight key considerations for efficiently training the next generation of reasoning models. This is the academic paper of our previous released Sky-T1-32B-Preview model. Codes are available at https://github.com/NovaSky-AI/SkyThought.

LLM（Large Language Models）は、デモンストレーションから推論を容易に学習できます。コンテンツではなく、構造が重要です！

LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

要旨

Support