ChatPaper.aiChatPaper

MM-PRM:通过可扩展的步骤级监督增强多模态数学推理能力

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

May 19, 2025
作者: Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, Wenqi Shao
cs.AI

摘要

尽管多模态大语言模型(MLLMs)在视觉-语言理解方面取得了显著进展,但在处理复杂的多步推理时仍面临挑战,常常产生逻辑不一致或部分正确的解决方案。一个关键限制在于缺乏对中间推理步骤的细粒度监督。为解决这一问题,我们提出了MM-PRM,这是一个在全自动化、可扩展框架下训练的过程奖励模型。我们首先构建了MM-Policy,一个在多样化数学推理数据上训练的强大多模态模型。随后,我们创建了MM-K12,一个包含10,000道带有可验证答案的多模态数学问题的精选数据集,作为种子数据。利用基于蒙特卡洛树搜索(MCTS)的流程,我们生成了超过70万步的注释,无需人工标注。由此得到的PRM用于在Best-of-N推理设置中对候选推理路径进行评分,并在领域内(MM-K12测试集)和领域外(如OlympiadBench、MathVista等)基准测试中均实现了显著提升。进一步分析证实了软标签、较小学习率以及路径多样性在优化PRM性能方面的有效性。MM-PRM证明了过程监督是增强多模态推理系统逻辑鲁棒性的有力工具。我们已在https://github.com/ModalMinds/MM-PRM上公开了所有代码和数据。
English
While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework. We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference setup and achieves significant improvements across both in-domain (MM-K12 test set) and out-of-domain (OlympiadBench, MathVista, etc.) benchmarks. Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance. MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at https://github.com/ModalMinds/MM-PRM.

Summary

AI-Generated Summary

PDF201May 20, 2025