訓練視覺-語言處理獎勵模型以實現多模態推理中的測試時擴展：關鍵見解與經驗教訓

摘要

过程奖励模型（PRMs）提供了步骤级别的监督，从而提升了大型语言模型推理的可靠性。尽管PRMs在基于文本的领域已得到广泛研究，但其在视觉语言模型（VLMs）中的扩展仍显有限。现有的视觉语言PRMs（VL-PRMs）依赖于蒙特卡洛树搜索（MCTS）进行数据构建，这往往会产生噪声监督信号，并限制任务间的泛化能力。在本研究中，我们旨在通过探索数据集构建、训练及测试时扩展的多样化策略，阐明VL-PRMs的设计空间。首先，我们引入了一种混合数据合成框架，该框架结合了MCTS与强VLM的判断，生成了更精确的步骤级别标签。其次，我们提出了感知聚焦的监督方法，使我们的PRM能够在推理的视觉基础阶段明确检测错误。再次，我们系统评估了多种测试时扩展策略，表明我们的PRMs能够可靠地引导VLMs朝向更准确的解决方案。我们的实验覆盖了五个多样化的多模态基准（MMMU、PuzzleVQA、AlgoPuzzleVQA、MathVista和MathVision），揭示了几个关键发现：(i) 在测试时扩展（TTS）期间，将VL-PRMs用作结果奖励模型（ORMs）时，其表现优于VL-PRM引导的过程步骤选择，(ii) 较小的VL-PRMs在检测过程错误方面能够匹敌甚至超越较大的模型，(iii) VL-PRMs揭示了更强VLM骨干中的潜在推理能力，(iv) 感知级别的监督带来了测试时扩展的显著增益，以及(v) 尽管未在高级数学推理数据集上训练VL-PRMs，不同策略的TTS性能在这些数据集上仍有所提升。我们希望我们的工作能激励进一步的研究，并支持VLMs的进步。

English

Process Reward Models (PRMs) provide step-level supervision that improves the reliability of reasoning in large language models. While PRMs have been extensively studied in text-based domains, their extension to Vision Language Models (VLMs) remains limited. Existing Vision-Language PRMs (VL-PRMs) rely on Monte Carlo Tree Search (MCTS) for data construction, which can often produce noisy supervision signals and limit generalization across tasks. In this work, we aim to elucidate the design space of VL-PRMs by exploring diverse strategies for dataset construction, training, and test-time scaling. First, we introduce a hybrid data synthesis framework that combines MCTS with judgments from a strong VLM, producing more accurate step-level labels. Second, we propose perception-focused supervision, enabling our PRM to explicitly detect errors at the visual grounding stage of reasoning. Third, we systematically evaluate multiple test-time scaling strategies, showing that our PRMs can reliably guide VLMs toward more accurate solutions. Our experiments covering five diverse multimodal benchmarks (MMMU, PuzzleVQA, AlgoPuzzleVQA, MathVista, and MathVision) reveal several key insights: (i) VL-PRMs when used as Outcome Reward Models (ORMs) during test-time scaling (TTS) can outperform VL-PRM guided process step selection, (ii) smaller VL-PRMs can match or even surpass larger ones in detecting process errors, (iii) VL-PRMs uncover latent reasoning abilities in stronger VLM backbones, (iv) perception-level supervision leads to significant gains in test-time scaling, and (v) TTS performance of different policies improve on advanced math reasoning datasets despite not training VL-PRMs on such datasets. We hope our work will motivate further research and support the advancement of VLMs.

訓練視覺-語言處理獎勵模型以實現多模態推理中的測試時擴展：關鍵見解與經驗教訓

Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned

摘要

Support