ProJudge：面向多模态大语言模型流程判定的多领域基准与指令微调数据集

摘要

鉴于多模态大语言模型（MLLMs）在解决科学问题时频繁出现错误，评估其推理过程的有效性对于确保可靠性及揭示模型细粒度弱点至关重要。由于人工评估既费时又成本高昂，将MLLMs作为自动化过程评判者已成为普遍做法。然而，这些基于模型的评判者的可靠性仍存疑问。为此，我们推出了ProJudgeBench，这是首个专门用于评估基于MLLM的过程评判者能力的综合基准。ProJudgeBench包含2,400个测试案例和50,118个步骤级标签，涵盖四个科学领域，具有多样化的难度级别和多模态内容。在ProJudgeBench中，每个步骤均由人类专家精心标注其正确性、错误类型及解释，从而系统评估评判者在检测、分类和诊断错误方面的能力。在ProJudgeBench上的评估显示，开源模型与专有模型之间存在显著的性能差距。为弥合这一差距，我们进一步提出了ProJudge-173k，一个大规模指令微调数据集，以及动态双阶段微调策略，该策略鼓励模型在评估解决方案前先通过问题解决进行显式推理。这两项贡献显著提升了开源模型的过程评估能力。所有资源将公开发布，以促进未来关于可靠多模态过程评估的研究。

English

As multi-modal large language models (MLLMs) frequently exhibit errors when solving scientific problems, evaluating the validity of their reasoning processes is critical for ensuring reliability and uncovering fine-grained model weaknesses. Since human evaluation is laborious and costly, prompting MLLMs as automated process judges has become a common practice. However, the reliability of these model-based judges remains uncertain. To address this, we introduce ProJudgeBench, the first comprehensive benchmark specifically designed for evaluating abilities of MLLM-based process judges. ProJudgeBench comprises 2,400 test cases and 50,118 step-level labels, spanning four scientific disciplines with diverse difficulty levels and multi-modal content. In ProJudgeBench, each step is meticulously annotated by human experts for correctness, error type, and explanation, enabling a systematic evaluation of judges' capabilities to detect, classify and diagnose errors. Evaluation on ProJudgeBench reveals a significant performance gap between open-source and proprietary models. To bridge this gap, we further propose ProJudge-173k, a large-scale instruction-tuning dataset, and a Dynamic Dual-Phase fine-tuning strategy that encourages models to explicitly reason through problem-solving before assessing solutions. Both contributions significantly enhance the process evaluation capabilities of open-source models. All the resources will be released to foster future research of reliable multi-modal process evaluation.

ProJudge：面向多模态大语言模型流程判定的多领域基准与指令微调数据集

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

摘要

Support