RefCritic: 改良フィードバックを用いた長い連鎖思考批評モデルのトレーニング

要旨

大規模言語モデル（LLMs）の急速な進展に伴い、正確なガイダンスを提供するための効果的な批評モジュールの開発が重要でありながらも困難な課題となっている。本論文では、まず、批評モジュールの構築のために広く採用されている教師ありファインチューニングが、モデルの批評能力を真に向上させることに失敗し、表面的な批評しか生成せず、十分な考察と検証を欠いていることを示す。未踏の批評能力を引き出すために、我々はRefCriticを提案する。これは、二重のルールベース報酬を用いた強化学習に基づく長い連鎖思考（long-chain-of-thought）批評モジュールであり、(1) 解決策の判断におけるインスタンスレベルの正確性と、(2) 批評に基づくポリシーモデルの改善精度を報酬として、効果的なモデル改善を導くための実行可能なフィードバックを伴う高品質な評価を生成することを目指す。RefCriticをQwen2.5-14B-InstructおよびDeepSeek-R1-Distill-Qwen-14Bにおいて5つのベンチマークで評価した。批評と改善の設定において、RefCriticはすべてのベンチマークで一貫した優位性を示し、例えばAIME25においてそれぞれのベースモデルに対して6.8％および7.2％の向上を達成した。特に、多数決投票において、RefCriticによってフィルタリングされたポリシーモデルは、投票数が増えるにつれて優れたスケーリングを示した。さらに、解決策レベルの監視で訓練されているにもかかわらず、RefCriticは数学的推論における誤ったステップを特定するためのベンチマークであるProcessBenchにおいて、ステップレベルの教師ありアプローチを上回る性能を示した。

English

With the rapid advancement of Large Language Models (LLMs), developing effective critic modules for precise guidance has become crucial yet challenging. In this paper, we initially demonstrate that supervised fine-tuning for building critic modules (which is widely adopted in current solutions) fails to genuinely enhance models' critique abilities, producing superficial critiques with insufficient reflections and verifications. To unlock the unprecedented critique capabilities, we propose RefCritic, a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards: (1) instance-level correctness of solution judgments and (2) refinement accuracies of the policy model based on critiques, aiming to generate high-quality evaluations with actionable feedback that effectively guides model refinement. We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks. On critique and refinement settings, RefCritic demonstrates consistent advantages across all benchmarks, e.g., 6.8\% and 7.2\% gains on AIME25 for the respective base models. Notably, under majority voting, policy models filtered by RefCritic show superior scaling with increased voting numbers. Moreover, despite training on solution-level supervision, RefCritic outperforms step-level supervised approaches on ProcessBench, a benchmark to identify erroneous steps in mathematical reasoning.

RefCritic: 改良フィードバックを用いた長い連鎖思考批評モデルのトレーニング

RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

要旨

Support