学习强化学习所不能：针对最难题目的交错在线微调

摘要

近期在大语言模型（LLM）推理领域的研究表明，通过强化学习（RL）可以涌现出诸如规划与自我反思等复杂行为。然而，尽管取得了这些成功，当前形式的强化学习仍不足以突破基础模型的局限，因为它主要基于模型现有知识进行优化，而非促进新信息的获取。为应对这一局限，我们采用监督微调（SFT）来学习强化学习所不能掌握的内容，通过利用高质量示范数据，实现新知识与推理模式的融入。我们分析了强化学习与监督微调在LLM推理中的训练动态，发现强化学习在维持并提升模型原有能力范围内问题的表现上更为出色，而监督微调则更有效地推动模型在现有范围之外问题上的进步。基于强化学习与监督微调的互补优势，我们提出了一种新颖的训练方法——ReLIFT（在线微调交替强化学习）。在ReLIFT中，模型主要采用强化学习进行训练，但当遇到难题时，会收集高质量解决方案进行微调，训练过程在强化学习与微调之间交替进行，以增强模型的推理能力。ReLIFT在五个竞赛级基准和一个分布外基准上，相较于其他零强化学习模型，平均提升了超过5.2分。此外，我们展示了ReLIFT在仅使用13%详细示范数据的情况下，性能优于单独使用强化学习或监督微调，凸显了其可扩展性。这些结果有力证明了ReLIFT克服了强化学习的根本局限，并彰显了其巨大的潜力。

English

Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, ReLIFT (Reinforcement Learning Interleaved with Online Fine-Tuning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.

学习强化学习所不能：针对最难题目的交错在线微调

Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

摘要

Support