揭開LLM中的長篇推理之謎

摘要

擴展推論計算能夠增強大型語言模型（LLMs）中的推理能力，長度較長的思維鏈（CoTs）使得回溯和錯誤更正等策略成為可能。強化學習（RL）已經成為發展這些能力的關鍵方法，然而，導致長 CoTs 出現的條件仍然不清楚，並且 RL 訓練需要仔細的設計選擇。在這項研究中，我們系統地研究了長 CoT 推理的機制，確定了使模型能夠生成長 CoT 軌跡的關鍵因素。通過大量的監督微調（SFT）和 RL 實驗，我們提出了四個主要發現：（1）雖然 SFT 不是絕對必要的，但它簡化了訓練並提高了效率；（2）推理能力往往隨著訓練計算量的增加而出現，但它們的發展並不是一定的，因此，對於穩定 CoT 長度增長，獎勵塑造至關重要；（3）擴展可驗證的獎勵信號對於 RL 至關重要。我們發現，利用帶有過濾機制的噪聲、從網絡提取的解決方案具有很強的潛力，特別是對於 STEM 推理等超出分佈（OOD）任務；以及（4）像錯誤更正這樣的核心能力在基本模型中本質上是存在的，但是通過 RL 有效地激勵這些技能以應對複雜任務需要大量計算，並且衡量它們的出現需要一種細緻的方法。這些見解為優化訓練策略以增強 LLM 中的長 CoT 推理提供了實用指導。我們的代碼可在以下鏈接找到：https://github.com/eddycmu/demystify-long-cot。

English

Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.

揭開LLM中的長篇推理之謎

Demystifying Long Chain-of-Thought Reasoning in LLMs

摘要

Support