LLMにおける長い推論連鎖の解明

要旨

推論計算のスケーリングは、大規模言語モデル（LLMs）における推論を強化し、長い思考の連鎖（CoTs）がバックトラッキングやエラー訂正などの戦略を可能にします。強化学習（RL）は、これらの能力を開発するための重要な手法として登場していますが、長いCoTsが発生する条件は依然として不明であり、RLのトレーニングには慎重な設計選択が必要です。本研究では、長いCoT推論のメカニクスを体系的に調査し、モデルが長いCoT軌跡を生成するための主要要因を特定します。包括的な教師付き微調整（SFT）とRL実験を通じて、以下の4つの主な結果を示します：（1）SFTは厳密に必要ではありませんが、トレーニングを簡素化し効率を向上させます；（2）推論能力はトレーニング計算量の増加とともに発現する傾向がありますが、その発展は保証されておらず、CoT長の成長を安定化させるために報酬の形成が重要です；（3）検証可能な報酬信号のスケーリングはRLにとって重要です。ノイズの多いWebから抽出された解決策をフィルタリングメカニズムと組み合わせることが、特にSTEM推論などの分布外タスクに対して強力な潜在能力を示すことがわかりました；そして（4）エラー訂正などの基本的な能力はベースモデルに元々備わっていますが、これらのスキルを効果的にRLによって複雑なタスクに対してインセンティブ付けするには、膨大な計算が必要であり、その発現を測定するには微妙なアプローチが必要です。これらの知見は、LLMsにおける長いCoT推論を強化するためのトレーニング戦略を最適化するための実践的なガイダンスを提供します。当該コードは以下から入手可能です：https://github.com/eddycmu/demystify-long-cot.

English

Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.

LLMにおける長い推論連鎖の解明

Demystifying Long Chain-of-Thought Reasoning in LLMs

要旨

Support