自我獎勵式數學推理校正
Self-rewarding correction for mathematical reasoning
February 26, 2025
作者: Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang
cs.AI
摘要
我們研究自獎勵推理大型語言模型(LLMs),這些模型能在推理過程中同時生成逐步推理並評估其輸出的正確性,而無需外部反饋。這種整合方法使單一模型能獨立引導其推理過程,為模型部署提供計算優勢。我們特別關注自我修正這一代表性任務,其中模型能自主檢測其回應中的錯誤、修正輸出,並決定何時終止迭代精煉循環。為實現這一點,我們提出了一個兩階段的算法框架,僅使用自生成數據來構建自獎勵推理模型。在第一階段,我們採用序列拒絕抽樣來合成包含自獎勵和自我修正機制的長鏈思維軌跡。在這些精心策劃的數據上微調模型,使其能學習自獎勵和自我修正的模式。在第二階段,我們通過基於規則信號的強化學習進一步增強模型評估回應準確性和精煉輸出的能力。Llama-3和Qwen-2.5的實驗表明,我們的方法超越了內在的自我修正能力,並達到了依賴外部獎勵模型的系統的相當性能。
English
We study self-rewarding reasoning large language models (LLMs), which can
simultaneously generate step-by-step reasoning and evaluate the correctness of
their outputs during the inference time-without external feedback. This
integrated approach allows a single model to independently guide its reasoning
process, offering computational advantages for model deployment. We
particularly focus on the representative task of self-correction, where models
autonomously detect errors in their responses, revise outputs, and decide when
to terminate iterative refinement loops. To enable this, we propose a
two-staged algorithmic framework for constructing self-rewarding reasoning
models using only self-generated data. In the first stage, we employ sequential
rejection sampling to synthesize long chain-of-thought trajectories that
incorporate both self-rewarding and self-correction mechanisms. Fine-tuning
models on these curated data allows them to learn the patterns of
self-rewarding and self-correction. In the second stage, we further enhance the
models' ability to assess response accuracy and refine outputs through
reinforcement learning with rule-based signals. Experiments with Llama-3 and
Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction
capabilities and achieves performance comparable to systems that rely on
external reward models.Summary
AI-Generated Summary