回溯多少才足夠?探索監督式微調與強化學習在提升大語言模型推理能力中的交互作用
How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning
May 30, 2025
作者: Hongyi James Cai, Junlin Wang, Xiaoyin Chen, Bhuwan Dhingra
cs.AI
摘要
近期在大語言模型(LLMs)領域的突破,通過監督微調(SFT)和強化學習(RL)等技術,顯著提升了模型在數學和邏輯問題上的推理能力,這些問題通常具有可驗證的答案。先前的研究表明,RL能有效內化搜索策略,使模型能夠進行長鏈式思維(CoT)推理,而回溯作為一種學習到的能力自然出現。然而,回溯的具體益處,尤其是它對推理改進的貢獻程度以及其最佳使用範圍,仍未被充分理解。在本研究中,我們系統地探討了SFT和RL在八個推理任務上的動態關係:倒計時、數獨、Arc 1D、幾何、彩色立方體旋轉、列表函數、斑馬謎題和自我參照。我們的研究結果表明,與冷啟動的RL相比,SFT中使用的短CoT序列確實對RL訓練有中等程度的貢獻;然而,當任務變得越來越困難時,這種貢獻會逐漸減弱。基於這一觀察,我們構建了在回溯步驟數量上系統變化的合成數據集,並進行了控制實驗,以隔離正確性(內容)或結構(即回溯頻率)的影響。我們發現:(1)帶有回溯的較長CoT通常能誘導出更好且更穩定的RL訓練;(2)搜索空間更大的更具挑戰性的問題在SFT階段往往需要更高數量的回溯。此外,我們通過對蒸餾數據的實驗證明,RL訓練在很大程度上不受長CoT序列正確性的影響,這表明RL更優先考慮結構模式而非內容正確性。總體而言,我們的研究結果為設計最佳訓練策略以有效擴展LLMs的推理能力提供了實用的見解。
English
Recent breakthroughs in large language models (LLMs) have effectively
improved their reasoning abilities, particularly on mathematical and logical
problems that have verifiable answers, through techniques such as supervised
finetuning (SFT) and reinforcement learning (RL). Prior research indicates that
RL effectively internalizes search strategies, enabling long chain-of-thought
(CoT) reasoning, with backtracking emerging naturally as a learned capability.
However, the precise benefits of backtracking, specifically, how significantly
it contributes to reasoning improvements and the optimal extent of its use,
remain poorly understood. In this work, we systematically investigate the
dynamics between SFT and RL on eight reasoning tasks: Countdown, Sudoku, Arc
1D, Geometry, Color Cube Rotation, List Functions, Zebra Puzzles, and Self
Reference. Our findings highlight that short CoT sequences used in SFT as a
warm-up do have moderate contribution to RL training, compared with cold-start
RL; however such contribution diminishes when tasks become increasingly
difficult. Motivated by this observation, we construct synthetic datasets
varying systematically in the number of backtracking steps and conduct
controlled experiments to isolate the influence of either the correctness
(content) or the structure (i.e., backtrack frequency). We find that (1) longer
CoT with backtracks generally induce better and more stable RL training, (2)
more challenging problems with larger search space tend to need higher numbers
of backtracks during the SFT stage. Additionally, we demonstrate through
experiments on distilled data that RL training is largely unaffected by the
correctness of long CoT sequences, suggesting that RL prioritizes structural
patterns over content correctness. Collectively, our results offer practical
insights into designing optimal training strategies to effectively scale
reasoning in LLMs.