ProJudge：一個多模態多領域的基準測試與指令微調數據集，專為基於多模態大語言模型的流程評判而設計

摘要

由於多模態大型語言模型（MLLMs）在解決科學問題時頻繁出現錯誤，評估其推理過程的有效性對於確保可靠性及揭示細粒度模型弱點至關重要。鑑於人工評估既耗時又昂貴，利用MLLMs作為自動化過程評判者已成為常見做法。然而，這些基於模型的評判者的可靠性仍不確定。為此，我們推出了ProJudgeBench，這是首個專為評估基於MLLM的過程評判者能力而設計的綜合基準。ProJudgeBench包含2,400個測試案例和50,118個步驟級標籤，涵蓋四個科學學科，具有多樣化的難度級別和多模態內容。在ProJudgeBench中，每個步驟均由人類專家精心註釋其正確性、錯誤類型和解釋，從而系統地評估評判者檢測、分類和診斷錯誤的能力。在ProJudgeBench上的評估顯示，開源模型與專有模型之間存在顯著的性能差距。為彌補這一差距，我們進一步提出了ProJudge-173k，這是一個大規模的指令微調數據集，以及一種動態雙階段微調策略，該策略鼓勵模型在評估解決方案之前明確地進行問題解決推理。這兩項貢獻顯著增強了開源模型的過程評估能力。所有資源將被公開，以促進未來可靠多模態過程評估的研究。

English

As multi-modal large language models (MLLMs) frequently exhibit errors when solving scientific problems, evaluating the validity of their reasoning processes is critical for ensuring reliability and uncovering fine-grained model weaknesses. Since human evaluation is laborious and costly, prompting MLLMs as automated process judges has become a common practice. However, the reliability of these model-based judges remains uncertain. To address this, we introduce ProJudgeBench, the first comprehensive benchmark specifically designed for evaluating abilities of MLLM-based process judges. ProJudgeBench comprises 2,400 test cases and 50,118 step-level labels, spanning four scientific disciplines with diverse difficulty levels and multi-modal content. In ProJudgeBench, each step is meticulously annotated by human experts for correctness, error type, and explanation, enabling a systematic evaluation of judges' capabilities to detect, classify and diagnose errors. Evaluation on ProJudgeBench reveals a significant performance gap between open-source and proprietary models. To bridge this gap, we further propose ProJudge-173k, a large-scale instruction-tuning dataset, and a Dynamic Dual-Phase fine-tuning strategy that encourages models to explicitly reason through problem-solving before assessing solutions. Both contributions significantly enhance the process evaluation capabilities of open-source models. All the resources will be released to foster future research of reliable multi-modal process evaluation.

ProJudge：一個多模態多領域的基準測試與指令微調數據集，專為基於多模態大語言模型的流程評判而設計

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

摘要

Support