回溯多少才够？探索监督微调与强化学习在提升大语言模型推理能力中的交互作用

摘要

近期在大语言模型（LLMs）领域的突破性进展，通过监督微调（SFT）和强化学习（RL）等技术，显著提升了其在数学和逻辑问题上的推理能力，这些问题通常具有可验证的答案。先前研究表明，RL能有效内化搜索策略，支持长链思维（CoT）推理，其中回溯作为一种习得能力自然显现。然而，回溯的具体益处，尤其是其对推理改进的贡献程度及最佳使用范围，仍不甚明了。本研究系统性地探讨了SFT与RL在八项推理任务（倒计时、数独、Arc 1D、几何、色块旋转、列表函数、斑马谜题及自指）中的动态关系。我们的发现强调，相较于冷启动RL，SFT中作为预热使用的短CoT序列对RL训练确实有中等程度的贡献；但随着任务难度增加，这种贡献逐渐减弱。基于此观察，我们构建了回溯步骤数量系统变化的合成数据集，并进行了控制实验，以分离正确性（内容）或结构（即回溯频率）的影响。研究发现：（1）包含回溯的长CoT通常能带来更好且更稳定的RL训练；（2）搜索空间更大的更复杂问题在SFT阶段往往需要更多次回溯。此外，通过蒸馏数据的实验，我们证明RL训练对长CoT序列的正确性依赖较小，表明RL更重视结构模式而非内容准确性。综合而言，我们的结果为设计最优训练策略以有效扩展LLMs的推理能力提供了实用见解。

English

Recent breakthroughs in large language models (LLMs) have effectively improved their reasoning abilities, particularly on mathematical and logical problems that have verifiable answers, through techniques such as supervised finetuning (SFT) and reinforcement learning (RL). Prior research indicates that RL effectively internalizes search strategies, enabling long chain-of-thought (CoT) reasoning, with backtracking emerging naturally as a learned capability. However, the precise benefits of backtracking, specifically, how significantly it contributes to reasoning improvements and the optimal extent of its use, remain poorly understood. In this work, we systematically investigate the dynamics between SFT and RL on eight reasoning tasks: Countdown, Sudoku, Arc 1D, Geometry, Color Cube Rotation, List Functions, Zebra Puzzles, and Self Reference. Our findings highlight that short CoT sequences used in SFT as a warm-up do have moderate contribution to RL training, compared with cold-start RL; however such contribution diminishes when tasks become increasingly difficult. Motivated by this observation, we construct synthetic datasets varying systematically in the number of backtracking steps and conduct controlled experiments to isolate the influence of either the correctness (content) or the structure (i.e., backtrack frequency). We find that (1) longer CoT with backtracks generally induce better and more stable RL training, (2) more challenging problems with larger search space tend to need higher numbers of backtracks during the SFT stage. Additionally, we demonstrate through experiments on distilled data that RL training is largely unaffected by the correctness of long CoT sequences, suggesting that RL prioritizes structural patterns over content correctness. Collectively, our results offer practical insights into designing optimal training strategies to effectively scale reasoning in LLMs.

回溯多少才够？探索监督微调与强化学习在提升大语言模型推理能力中的交互作用

How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning

摘要

Support