RefCritic: 정제 피드백을 통한 장기 사고 사슬 비평 모델 학습

초록

대규모 언어 모델(LLMs)의 급속한 발전과 함께, 정확한 지도를 위한 효과적인 비평 모듈 개발은 중요하면서도 도전적인 과제로 대두되고 있다. 본 논문에서는 먼저, 현재 널리 채택되고 있는 지도 학습 기반의 비평 모듈 구축 방식이 모델의 비평 능력을 진정으로 향상시키지 못하고, 충분한 반성과 검증이 부족한 피상적인 비평을 생성한다는 점을 실증적으로 보여준다. 이를 해결하기 위해, 우리는 이전에 없던 비평 능력을 발휘할 수 있는 RefCritic을 제안한다. RefCritic은 이중 규칙 기반 보상을 활용한 강화 학습에 기반한 장기 사고 사슬(long-chain-of-thought) 비평 모듈로, (1) 해결 판단의 사례 수준 정확성과 (2) 비평을 기반으로 한 정책 모델의 개선 정확성을 목표로 하여, 모델 개선을 효과적으로 이끌 수 있는 실행 가능한 피드백과 함께 고품질의 평가를 생성한다. 우리는 RefCritic을 Qwen2.5-14B-Instruct와 DeepSeek-R1-Distill-Qwen-14B 모델에 적용하여 다섯 가지 벤치마크에서 평가하였다. 비평 및 개선 설정에서 RefCritic은 모든 벤치마크에서 일관된 우위를 보였으며, 예를 들어 AIME25에서 각각의 기본 모델에 대해 6.8%와 7.2%의 성능 향상을 달성했다. 특히, 다수결 투표 하에서 RefCritic으로 필터링된 정책 모델은 투표 수가 증가함에 따라 우수한 확장성을 보였다. 또한, 해결 수준의 지도 학습으로 훈련되었음에도 불구하고, RefCritic은 수학적 추론에서 오류가 있는 단계를 식별하는 벤치마크인 ProcessBench에서 단계 수준의 지도 학습 접근법을 능가하는 성능을 보였다.

English

With the rapid advancement of Large Language Models (LLMs), developing effective critic modules for precise guidance has become crucial yet challenging. In this paper, we initially demonstrate that supervised fine-tuning for building critic modules (which is widely adopted in current solutions) fails to genuinely enhance models' critique abilities, producing superficial critiques with insufficient reflections and verifications. To unlock the unprecedented critique capabilities, we propose RefCritic, a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards: (1) instance-level correctness of solution judgments and (2) refinement accuracies of the policy model based on critiques, aiming to generate high-quality evaluations with actionable feedback that effectively guides model refinement. We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks. On critique and refinement settings, RefCritic demonstrates consistent advantages across all benchmarks, e.g., 6.8\% and 7.2\% gains on AIME25 for the respective base models. Notably, under majority voting, policy models filtered by RefCritic show superior scaling with increased voting numbers. Moreover, despite training on solution-level supervision, RefCritic outperforms step-level supervised approaches on ProcessBench, a benchmark to identify erroneous steps in mathematical reasoning.

RefCritic: 정제 피드백을 통한 장기 사고 사슬 비평 모델 학습

RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

초록

Support