RefCritic：基於精煉反饋的長鏈思維批判模型訓練

摘要

隨著大型語言模型（LLMs）的快速發展，開發有效的評判模組以提供精確指導已成為關鍵且具挑戰性的任務。本文首先展示，當前廣泛採用的基於監督微調的評判模組建構方法，並未能真正提升模型的評判能力，反而產生了表面化、缺乏深入反思與驗證的評判。為釋放前所未有的評判潛力，我們提出了RefCritic，這是一個基於強化學習的長鏈思維評判模組，採用雙重基於規則的獎勵機制：（1）解決方案判斷的實例級正確性，以及（2）基於評判的策略模型改進準確性，旨在生成高質量的評估並提供可操作的反馈，有效指導模型改進。我們在Qwen2.5-14B-Instruct和DeepSeek-R1-Distill-Qwen-14B上對RefCritic進行了評估，涵蓋五個基準測試。在評判與改進設置下，RefCritic在所有基準測試中均展現出一致的優勢，例如，在AIME25上，兩個基礎模型分別取得了6.8%和7.2%的提升。值得注意的是，在多數投票機制下，經RefCritic篩選的策略模型隨著投票數量的增加展現出更優的擴展性。此外，儘管RefCritic在解決方案級別的監督下進行訓練，其在ProcessBench（一個用於識別數學推理中錯誤步驟的基準測試）上的表現仍優於步驟級別的監督方法。

English

With the rapid advancement of Large Language Models (LLMs), developing effective critic modules for precise guidance has become crucial yet challenging. In this paper, we initially demonstrate that supervised fine-tuning for building critic modules (which is widely adopted in current solutions) fails to genuinely enhance models' critique abilities, producing superficial critiques with insufficient reflections and verifications. To unlock the unprecedented critique capabilities, we propose RefCritic, a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards: (1) instance-level correctness of solution judgments and (2) refinement accuracies of the policy model based on critiques, aiming to generate high-quality evaluations with actionable feedback that effectively guides model refinement. We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks. On critique and refinement settings, RefCritic demonstrates consistent advantages across all benchmarks, e.g., 6.8\% and 7.2\% gains on AIME25 for the respective base models. Notably, under majority voting, policy models filtered by RefCritic show superior scaling with increased voting numbers. Moreover, despite training on solution-level supervision, RefCritic outperforms step-level supervised approaches on ProcessBench, a benchmark to identify erroneous steps in mathematical reasoning.

RefCritic：基於精煉反饋的長鏈思維批判模型訓練

RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

摘要

Support