训练视觉-语言处理奖励模型以实现多模态推理中的测试时扩展：关键洞察与经验总结

摘要

过程奖励模型（PRMs）通过提供步骤级别的监督，提升了大型语言模型推理的可靠性。尽管PRMs在文本领域已得到广泛研究，但其在视觉语言模型（VLMs）中的应用仍显不足。现有的视觉语言PRMs（VL-PRMs）依赖于蒙特卡洛树搜索（MCTS）进行数据构建，这种方法常产生噪声监督信号，限制了任务间的泛化能力。本研究中，我们旨在通过探索数据集构建、训练及测试时扩展的多样化策略，阐明VL-PRMs的设计空间。首先，我们引入了一种混合数据合成框架，将MCTS与强VLM的判断相结合，生成更精确的步骤级别标签。其次，我们提出了以感知为中心的监督方式，使PRM能够在推理的视觉定位阶段明确检测错误。再次，我们系统评估了多种测试时扩展策略，证明我们的PRMs能有效引导VLMs获得更准确的解决方案。我们在五个多模态基准测试（MMMU、PuzzleVQA、AlgoPuzzleVQA、MathVista和MathVision）上的实验揭示了几个关键发现：(i) 在测试时扩展（TTS）中，将VL-PRMs用作结果奖励模型（ORMs）能超越基于VL-PRM引导的过程步骤选择，(ii) 较小的VL-PRMs在检测过程错误方面能与甚至超越较大的模型，(iii) VL-PRMs揭示了更强VLM骨干中的潜在推理能力，(iv) 感知级别的监督显著提升了测试时扩展的效果，(v) 尽管未在高级数学推理数据集上训练VL-PRMs，不同策略的TTS性能仍有所提升。我们期望本工作能激励进一步研究，推动VLMs的发展。

English

Process Reward Models (PRMs) provide step-level supervision that improves the reliability of reasoning in large language models. While PRMs have been extensively studied in text-based domains, their extension to Vision Language Models (VLMs) remains limited. Existing Vision-Language PRMs (VL-PRMs) rely on Monte Carlo Tree Search (MCTS) for data construction, which can often produce noisy supervision signals and limit generalization across tasks. In this work, we aim to elucidate the design space of VL-PRMs by exploring diverse strategies for dataset construction, training, and test-time scaling. First, we introduce a hybrid data synthesis framework that combines MCTS with judgments from a strong VLM, producing more accurate step-level labels. Second, we propose perception-focused supervision, enabling our PRM to explicitly detect errors at the visual grounding stage of reasoning. Third, we systematically evaluate multiple test-time scaling strategies, showing that our PRMs can reliably guide VLMs toward more accurate solutions. Our experiments covering five diverse multimodal benchmarks (MMMU, PuzzleVQA, AlgoPuzzleVQA, MathVista, and MathVision) reveal several key insights: (i) VL-PRMs when used as Outcome Reward Models (ORMs) during test-time scaling (TTS) can outperform VL-PRM guided process step selection, (ii) smaller VL-PRMs can match or even surpass larger ones in detecting process errors, (iii) VL-PRMs uncover latent reasoning abilities in stronger VLM backbones, (iv) perception-level supervision leads to significant gains in test-time scaling, and (v) TTS performance of different policies improve on advanced math reasoning datasets despite not training VL-PRMs on such datasets. We hope our work will motivate further research and support the advancement of VLMs.

训练视觉-语言处理奖励模型以实现多模态推理中的测试时扩展：关键洞察与经验总结

Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned

摘要

Support