ChatPaper.aiChatPaper

MM-PRM:通過可擴展的步驟級監督增強多模態數學推理能力

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

May 19, 2025
作者: Lingxiao Du, Fanqing Meng, Zongkai Liu, Zhixiang Zhou, Ping Luo, Qiaosheng Zhang, Wenqi Shao
cs.AI

摘要

儘管多模態大型語言模型(MLLMs)在視覺-語言理解方面取得了顯著進展,它們在處理複雜的多步驟推理時仍面臨挑戰,常常產生邏輯不一致或部分正確的解決方案。這一限制主要源於對中間推理步驟缺乏細粒度的監督。為此,我們提出了MM-PRM,這是一個在全自動、可擴展框架內訓練的過程獎勵模型。我們首先構建了MM-Policy,這是一個在多樣化數學推理數據上訓練的強大多模態模型。接著,我們構建了MM-K12,這是一個包含10,000道多模態數學問題的精心策劃數據集,這些問題均具有可驗證的答案,作為種子數據。利用基於蒙特卡羅樹搜索(MCTS)的流程,我們生成了超過70萬條步驟級別的註釋,無需人工標註。由此產生的PRM用於在Best-of-N推理設置中對候選推理路徑進行評分,並在域內(MM-K12測試集)和域外(OlympiadBench、MathVista等)基準測試中均取得了顯著提升。進一步分析證實了軟標籤、較小的學習率以及路徑多樣性在優化PRM性能方面的有效性。MM-PRM展示了過程監督是增強多模態推理系統邏輯魯棒性的有力工具。我們在https://github.com/ModalMinds/MM-PRM上公開了所有代碼和數據。
English
While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM, a process reward model trained within a fully automated, scalable framework. We first build MM-Policy, a strong multimodal model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting PRM is used to score candidate reasoning paths in the Best-of-N inference setup and achieves significant improvements across both in-domain (MM-K12 test set) and out-of-domain (OlympiadBench, MathVista, etc.) benchmarks. Further analysis confirms the effectiveness of soft labels, smaller learning rates, and path diversity in optimizing PRM performance. MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at https://github.com/ModalMinds/MM-PRM.

Summary

AI-Generated Summary

PDF201May 20, 2025