DeepCritic:基於大型語言模型的深思熟慮批判
DeepCritic: Deliberate Critique with Large Language Models
May 1, 2025
作者: Wenkai Yang, Jingwen Chen, Yankai Lin, Ji-Rong Wen
cs.AI
摘要
隨著大型語言模型(LLMs)的迅速發展,如何對其輸出提供精確反饋並實現可擴展的監督,已成為一項迫切且關鍵的挑戰。利用LLMs作為評判模型以實現自動化監督,是一種頗具前景的解決方案。本研究聚焦於探討並提升LLMs在數學評判方面的能力。現有的LLM評判模型對每一步推理的評判過於淺顯,導致判斷準確率低下,且難以為LLM生成器提供足夠的反饋來修正錯誤。為解決這一問題,我們提出了一種新穎且高效的兩階段框架,旨在開發能夠對數學解答的每一步推理進行深思熟慮評判的LLM評判模型。在第一階段,我們利用Qwen2.5-72B-Instruct生成了4.5K條長篇評判作為監督微調的種子數據。每條種子評判包含針對每一步推理的多角度驗證以及對初始評判的深入評判。隨後,我們對微調後的模型進行強化學習,使用PRM800K中現有的人工標註數據或通過基於蒙特卡羅採樣的正確性估計自動獲得的註釋數據,進一步激勵其評判能力。基於Qwen2.5-7B-Instruct開發的評判模型,不僅在多種錯誤識別基準測試中顯著優於現有的LLM評判模型(包括同規模的DeepSeek-R1-distill模型和GPT-4o),而且通過更為詳盡的反饋,更有效地幫助LLM生成器修正錯誤步驟。
English
As Large Language Models (LLMs) are rapidly evolving, providing accurate
feedback and scalable oversight on their outputs becomes an urgent and critical
problem. Leveraging LLMs as critique models to achieve automated supervision is
a promising solution. In this work, we focus on studying and enhancing the math
critique ability of LLMs. Current LLM critics provide critiques that are too
shallow and superficial on each step, leading to low judgment accuracy and
struggling to offer sufficient feedback for the LLM generator to correct
mistakes. To tackle this issue, we propose a novel and effective two-stage
framework to develop LLM critics that are capable of deliberately critiquing on
each reasoning step of math solutions. In the first stage, we utilize
Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for
supervised fine-tuning. Each seed critique consists of deliberate step-wise
critiques that includes multi-perspective verifications as well as in-depth
critiques of initial critiques for each reasoning step. Then, we perform
reinforcement learning on the fine-tuned model with either existing
human-labeled data from PRM800K or our automatically annotated data obtained
via Monte Carlo sampling-based correctness estimation, to further incentivize
its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct
not only significantly outperforms existing LLM critics (including the
same-sized DeepSeek-R1-distill models and GPT-4o) on various error
identification benchmarks, but also more effectively helps the LLM generator
refine erroneous steps through more detailed feedback.Summary
AI-Generated Summary