學習強化學習所不能：針對最難問題的交錯式線上微調

摘要

近期大型語言模型（LLM）推理能力的進展表明，諸如規劃和自我反思等複雜行為可以通過強化學習（RL）自然湧現。然而，儘管取得了這些成功，現有的RL形式仍不足以誘導出超越基礎模型限制的能力，因為它主要基於模型現有知識進行優化，而非促進新資訊的獲取。為解決這一限制，我們採用監督微調（SFT）來學習RL無法掌握的內容，這使得通過利用高質量示範數據來整合新知識和推理模式成為可能。我們分析了RL和SFT在LLM推理中的訓練動態，發現RL在保持和提升模型原有能力範圍內問題的表現上表現出色，而SFT則更有效地推動模型在當前範圍之外問題上的進步。基於RL和SFT的互補優勢，我們引入了一種新穎的訓練方法——ReLIFT（強化學習與線上微調交替進行）。在ReLIFT中，模型主要使用RL進行訓練，但當遇到具有挑戰性的問題時，會收集高質量解決方案進行微調，訓練過程在RL和微調之間交替進行，以增強模型的推理能力。與其他零RL模型相比，ReLIFT在五個競賽級基準和一個分佈外基準上平均提升了超過+5.2分。此外，我們展示了ReLIFT在僅使用13%詳細示範數據的情況下，其表現優於RL和SFT，突顯了其可擴展性。這些結果提供了有力證據，表明ReLIFT克服了RL的根本限制，並彰顯了其巨大的潛力。

English

Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, ReLIFT (Reinforcement Learning Interleaved with Online Fine-Tuning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.

學習強化學習所不能：針對最難問題的交錯式線上微調

Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

摘要

Support