S^2R：透過強化學習教導大型語言模型自我驗證與自我修正

摘要

近期研究已證實大型語言模型（LLM）在測試時進行規模調整的有效性。然而，現有激發LLM深度思考能力的方法通常需要大規模數據或顯著的訓練投入。同時，如何提升較弱基礎模型的思考能力仍不明確。在本研究中，我們提出了S^2R，這是一個高效的框架，通過教導模型在推理過程中自我驗證和自我修正來增強LLM的推理能力。具體而言，我們首先通過在精心策劃的數據上進行監督式微調，初始化LLM的迭代自我驗證和自我修正行為。隨後，利用結果層面和過程層面的強化學習進一步強化這些自我驗證與修正技能，以最小化的資源需求，使模型能在推理過程中自適應地精煉其推理流程。我們的結果顯示，僅需3.1k個自我驗證與修正行為的初始化樣本，Qwen2.5-math-7B的準確率就從51.0%提升至81.6%，超越了基於同等量長鏈思維蒸餾數據訓練的模型。基於三個基礎模型在領域內外基準上的廣泛實驗與分析，驗證了S^2R的有效性。我們的代碼與數據可在https://github.com/NineAbyss/S2R獲取。

English

Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S^2R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S^2R. Our code and data are available at https://github.com/NineAbyss/S2R.

S^2R：透過強化學習教導大型語言模型自我驗證與自我修正

S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

摘要

Support