RefCritic：基于精炼反馈的长链思维批判模型的训练

摘要

随着大语言模型（LLMs）的迅猛发展，构建有效的批评模块以实现精准指导变得至关重要，却也颇具挑战。本文首先揭示，当前广泛采用的监督微调方法在构建批评模块时，未能真正提升模型的批判能力，仅产生浅显的批评，缺乏深入的反思与验证。为释放前所未有的批判潜力，我们提出了RefCritic，一个基于强化学习的长链思维批评模块，采用双重规则奖励机制：（1）解决方案判断的实例级准确性，以及（2）基于批评的策略模型精炼准确度，旨在生成高质量评估并提供可操作的反馈，有效指导模型优化。我们在Qwen2.5-14B-Instruct和DeepSeek-R1-Distill-Qwen-14B模型上，通过五项基准测试评估了RefCritic。在批评与精炼场景下，RefCritic在所有基准测试中均展现出持续优势，例如，在AIME25上，两个基础模型分别提升了6.8%和7.2%。值得注意的是，在多数投票机制下，经RefCritic筛选的策略模型随着投票数增加展现出更优的扩展性。此外，尽管RefCritic在解决方案层面进行监督训练，但在ProcessBench（一个用于识别数学推理中错误步骤的基准测试）上，其表现超越了步骤级监督方法。

English

With the rapid advancement of Large Language Models (LLMs), developing effective critic modules for precise guidance has become crucial yet challenging. In this paper, we initially demonstrate that supervised fine-tuning for building critic modules (which is widely adopted in current solutions) fails to genuinely enhance models' critique abilities, producing superficial critiques with insufficient reflections and verifications. To unlock the unprecedented critique capabilities, we propose RefCritic, a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards: (1) instance-level correctness of solution judgments and (2) refinement accuracies of the policy model based on critiques, aiming to generate high-quality evaluations with actionable feedback that effectively guides model refinement. We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks. On critique and refinement settings, RefCritic demonstrates consistent advantages across all benchmarks, e.g., 6.8\% and 7.2\% gains on AIME25 for the respective base models. Notably, under majority voting, policy models filtered by RefCritic show superior scaling with increased voting numbers. Moreover, despite training on solution-level supervision, RefCritic outperforms step-level supervised approaches on ProcessBench, a benchmark to identify erroneous steps in mathematical reasoning.

RefCritic：基于精炼反馈的长链思维批判模型的训练

RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

摘要

Support