ChatPaper.aiChatPaper

DeepCritic:基于大语言模型的深度评析系统

DeepCritic: Deliberate Critique with Large Language Models

May 1, 2025
作者: Wenkai Yang, Jingwen Chen, Yankai Lin, Ji-Rong Wen
cs.AI

摘要

随着大型语言模型(LLMs)的快速发展,对其输出提供准确反馈和可扩展的监督已成为一个紧迫且关键的问题。利用LLMs作为批判模型以实现自动化监督,是一个颇具前景的解决方案。在本研究中,我们专注于探索并提升LLMs的数学批判能力。当前的LLM批判者在每一步提供的批判过于肤浅,导致判断准确性低,且难以向LLM生成器提供足够的反馈以纠正错误。为解决这一问题,我们提出了一种新颖且有效的两阶段框架,旨在开发能够对数学解答的每一步推理进行深思熟虑批判的LLM批判者。在第一阶段,我们利用Qwen2.5-72B-Instruct生成了4.5K条长篇批判作为监督微调的种子数据。每条种子批判包含深思熟虑的逐步批判,其中包括多角度验证以及对每一步推理初始批判的深入分析。随后,我们基于PRM800K中现有的人工标注数据或通过蒙特卡洛采样正确性估计自动获取的标注数据,对微调后的模型进行强化学习,以进一步激发其批判能力。基于Qwen2.5-7B-Instruct构建的批判模型,不仅在多项错误识别基准测试中显著优于现有的LLM批判者(包括同规模的DeepSeek-R1-distill模型和GPT-4o),而且通过提供更详尽的反馈,更有效地帮助LLM生成器修正错误步骤。
English
As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.

Summary

AI-Generated Summary

PDF447May 4, 2025