DeepCritic: 大規模言語モデルを用いた意図的な批評

要旨

大規模言語モデル（LLMs）が急速に進化する中、その出力に対する正確なフィードバックとスケーラブルな監視を提供することが緊急かつ重要な課題となっています。LLMsを批評モデルとして活用し、自動化された監督を実現することは有望な解決策です。本研究では、LLMsの数学的批評能力の研究と強化に焦点を当てます。現在のLLM批評モデルは、各ステップに対する批評が浅く表面的であり、判断精度が低く、LLM生成モデルが誤りを修正するための十分なフィードバックを提供することが困難です。この問題に対処するため、数学的解法の各推論ステップを意図的に批評できるLLM批評モデルを開発するための新規かつ効果的な2段階フレームワークを提案します。第1段階では、Qwen2.5-72B-Instructを利用して4.5Kの長文批評を生成し、教師ありファインチューニングのためのシードデータとします。各シード批評は、多角的な検証を含む意図的なステップごとの批評と、各推論ステップに対する初期批評の深い批評で構成されます。次に、PRM800Kの人間によるラベル付きデータまたはモンテカルロサンプリングに基づく正しさ推定を用いて自動的にアノテーションされたデータを用いて、ファインチューニングされたモデルに対して強化学習を行い、その批評能力をさらに向上させます。Qwen2.5-7B-Instructに基づいて開発された批評モデルは、様々な誤り識別ベンチマークにおいて、既存のLLM批評モデル（同サイズのDeepSeek-R1-distillモデルやGPT-4oを含む）を大幅に上回るだけでなく、より詳細なフィードバックを通じてLLM生成モデルが誤ったステップを修正するのにより効果的に役立ちます。

English

As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.

DeepCritic: 大規模言語モデルを用いた意図的な批評

DeepCritic: Deliberate Critique with Large Language Models

要旨

Support