ChatPaper.aiChatPaper

MPBench:一個全面的多模態推理基準,用於流程錯誤識別

MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification

March 16, 2025
作者: Zhaopan Xu, Pengfei Zhou, Jiaxin Ai, Wangbo Zhao, Kai Wang, Xiaojiang Peng, Wenqi Shao, Hongxun Yao, Kaipeng Zhang
cs.AI

摘要

推理能力是大型語言模型(LLMs)處理複雜任務的核心能力,其中過程錯誤的識別對於提升這一能力至關重要。近期,過程級獎勵模型(PRMs)被提出,旨在提供逐步獎勵,以促進訓練期間的強化學習與數據生成,並在推理過程中引導LLMs走向正確步驟,從而提高推理準確性。然而,現有的PRMs基準測試多基於文本,且側重於錯誤檢測,忽略了如推理搜索等其他場景。為填補這一空白,我們引入了MPBench,這是一個全面、多任務、多模態的基準測試,旨在系統評估PRMs在多元場景下的有效性。MPBench採用三種評估範式,每種範式針對PRMs在推理過程中的特定角色:(1) 步驟正確性,評估每個中間推理步驟的正確性;(2) 答案聚合,匯總多種解決方案並選取最佳者;(3) 推理過程搜索,在推理過程中引導尋找最優推理步驟。通過這些範式,MPBench實現了全面評估,並為多模態PRMs的發展提供了洞見。
English
Reasoning is an essential capacity for large language models (LLMs) to address complex tasks, where the identification of process errors is vital for improving this ability. Recently, process-level reward models (PRMs) were proposed to provide step-wise rewards that facilitate reinforcement learning and data production during training and guide LLMs toward correct steps during inference, thereby improving reasoning accuracy. However, existing benchmarks of PRMs are text-based and focus on error detection, neglecting other scenarios like reasoning search. To address this gap, we introduce MPBench, a comprehensive, multi-task, multimodal benchmark designed to systematically assess the effectiveness of PRMs in diverse scenarios. MPBench employs three evaluation paradigms, each targeting a specific role of PRMs in the reasoning process: (1) Step Correctness, which assesses the correctness of each intermediate reasoning step; (2) Answer Aggregation, which aggregates multiple solutions and selects the best one; and (3) Reasoning Process Search, which guides the search for optimal reasoning steps during inference. Through these paradigms, MPBench makes comprehensive evaluations and provides insights into the development of multimodal PRMs.

Summary

AI-Generated Summary

PDF92March 19, 2025