S^2R: 強化学習によるLLMの自己検証と自己修正の教育

要旨

最近の研究では、LLM（大規模言語モデル）のテスト時スケーリングの有効性が実証されています。しかし、LLMの深い思考能力を促進する既存のアプローチは、大規模なデータや多大なトレーニング努力を一般的に必要とします。一方で、性能の低いベースモデルの思考能力を向上させる方法については、まだ明確ではありません。本研究では、推論中にモデルが自己検証と自己修正を行うことを教えることで、LLMの推論能力を向上させる効率的なフレームワークであるS^2Rを提案します。具体的には、まず慎重に選ばれたデータを用いた教師ありファインチューニングを通じて、LLMに反復的な自己検証と自己修正の動作を初期化します。その後、結果レベルとプロセスレベルの強化学習によって、自己検証と自己修正のスキルをさらに強化し、最小限のリソース要件で推論中にモデルが適応的に推論プロセスを洗練できるようにします。私たちの結果は、わずか3.1kの自己検証と自己修正の動作初期化サンプルを用いることで、Qwen2.5-math-7Bの精度が51.0\%から81.6\%に向上し、同等量の長いCoT（Chain-of-Thought）蒸留データでトレーニングされたモデルを上回ることを示しています。3つのベースモデルを用いたドメイン内およびドメイン外のベンチマークに基づく広範な実験と分析により、S^2Rの有効性が検証されました。私たちのコードとデータはhttps://github.com/NineAbyss/S2Rで公開されています。

English

Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S^2R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S^2R. Our code and data are available at https://github.com/NineAbyss/S2R.

S^2R: 強化学習によるLLMの自己検証と自己修正の教育

S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning

要旨

Support