回溯多少才足夠？探索監督式微調與強化學習在提升大語言模型推理能力中的交互作用

摘要

近期在大語言模型（LLMs）領域的突破，通過監督微調（SFT）和強化學習（RL）等技術，顯著提升了模型在數學和邏輯問題上的推理能力，這些問題通常具有可驗證的答案。先前的研究表明，RL能有效內化搜索策略，使模型能夠進行長鏈式思維（CoT）推理，而回溯作為一種學習到的能力自然出現。然而，回溯的具體益處，尤其是它對推理改進的貢獻程度以及其最佳使用範圍，仍未被充分理解。在本研究中，我們系統地探討了SFT和RL在八個推理任務上的動態關係：倒計時、數獨、Arc 1D、幾何、彩色立方體旋轉、列表函數、斑馬謎題和自我參照。我們的研究結果表明，與冷啟動的RL相比，SFT中使用的短CoT序列確實對RL訓練有中等程度的貢獻；然而，當任務變得越來越困難時，這種貢獻會逐漸減弱。基於這一觀察，我們構建了在回溯步驟數量上系統變化的合成數據集，並進行了控制實驗，以隔離正確性（內容）或結構（即回溯頻率）的影響。我們發現：（1）帶有回溯的較長CoT通常能誘導出更好且更穩定的RL訓練；（2）搜索空間更大的更具挑戰性的問題在SFT階段往往需要更高數量的回溯。此外，我們通過對蒸餾數據的實驗證明，RL訓練在很大程度上不受長CoT序列正確性的影響，這表明RL更優先考慮結構模式而非內容正確性。總體而言，我們的研究結果為設計最佳訓練策略以有效擴展LLMs的推理能力提供了實用的見解。

English

Recent breakthroughs in large language models (LLMs) have effectively improved their reasoning abilities, particularly on mathematical and logical problems that have verifiable answers, through techniques such as supervised finetuning (SFT) and reinforcement learning (RL). Prior research indicates that RL effectively internalizes search strategies, enabling long chain-of-thought (CoT) reasoning, with backtracking emerging naturally as a learned capability. However, the precise benefits of backtracking, specifically, how significantly it contributes to reasoning improvements and the optimal extent of its use, remain poorly understood. In this work, we systematically investigate the dynamics between SFT and RL on eight reasoning tasks: Countdown, Sudoku, Arc 1D, Geometry, Color Cube Rotation, List Functions, Zebra Puzzles, and Self Reference. Our findings highlight that short CoT sequences used in SFT as a warm-up do have moderate contribution to RL training, compared with cold-start RL; however such contribution diminishes when tasks become increasingly difficult. Motivated by this observation, we construct synthetic datasets varying systematically in the number of backtracking steps and conduct controlled experiments to isolate the influence of either the correctness (content) or the structure (i.e., backtrack frequency). We find that (1) longer CoT with backtracks generally induce better and more stable RL training, (2) more challenging problems with larger search space tend to need higher numbers of backtracks during the SFT stage. Additionally, we demonstrate through experiments on distilled data that RL training is largely unaffected by the correctness of long CoT sequences, suggesting that RL prioritizes structural patterns over content correctness. Collectively, our results offer practical insights into designing optimal training strategies to effectively scale reasoning in LLMs.

回溯多少才足夠？探索監督式微調與強化學習在提升大語言模型推理能力中的交互作用

How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning

摘要

Support