다중모드 추론에서 테스트 시간 스케일링을 위한 비전-언어 프로세스 보상 모델 훈련: 주요 통찰과 교훈

초록

프로세스 보상 모델(PRMs)은 대규모 언어 모델의 추론 신뢰성을 향상시키는 단계별 감독을 제공합니다. PRMs는 텍스트 기반 도메인에서 광범위하게 연구되었지만, 비전 언어 모델(VLMs)로의 확장은 여전히 제한적입니다. 기존의 비전-언어 PRMs(VL-PRMs)는 데이터 구성을 위해 몬테카를로 트리 탐색(MCTS)에 의존하는데, 이는 종종 노이즈가 많은 감독 신호를 생성하고 작업 간 일반화를 제한할 수 있습니다. 본 연구에서는 데이터셋 구성, 훈련, 테스트 시 스케일링에 대한 다양한 전략을 탐구함으로써 VL-PRMs의 설계 공간을 명확히 하고자 합니다. 첫째, MCTS와 강력한 VLM의 판단을 결합한 하이브리드 데이터 합성 프레임워크를 도입하여 더 정확한 단계별 레이블을 생성합니다. 둘째, 시각적 근거 단계에서의 오류를 명시적으로 감지할 수 있는 인식 중심 감독을 제안합니다. 셋째, 여러 테스트 시 스케일링 전략을 체계적으로 평가하여 우리의 PRMs가 VLMs를 더 정확한 솔루션으로 안내할 수 있음을 보여줍니다. 다섯 가지 다양한 멀티모달 벤치마크(MMMU, PuzzleVQA, AlgoPuzzleVQA, MathVista, MathVision)를 대상으로 한 실험을 통해 다음과 같은 주요 통찰을 얻었습니다: (i) 테스트 시 스케일링(TTS) 동안 결과 보상 모델(ORMs)로 사용될 때 VL-PRMs는 VL-PRM이 안내하는 프로세스 단계 선택을 능가할 수 있음, (ii) 더 작은 VL-PRMs가 더 큰 모델과 동등하거나 더 나은 프로세스 오류 감지 성능을 보임, (iii) VL-PRMs는 더 강력한 VLM 백본에서 잠재된 추론 능력을 발견함, (iv) 인식 수준의 감독은 테스트 시 스케일링에서 상당한 성능 향상을 이끔, (v) 고급 수학 추론 데이터셋에서 VL-PRMs를 훈련하지 않았음에도 다양한 정책의 TTS 성능이 개선됨. 본 연구가 VLMs의 발전을 촉진하고 추가 연구를 격려하는 데 기여하기를 바랍니다.

English

Process Reward Models (PRMs) provide step-level supervision that improves the reliability of reasoning in large language models. While PRMs have been extensively studied in text-based domains, their extension to Vision Language Models (VLMs) remains limited. Existing Vision-Language PRMs (VL-PRMs) rely on Monte Carlo Tree Search (MCTS) for data construction, which can often produce noisy supervision signals and limit generalization across tasks. In this work, we aim to elucidate the design space of VL-PRMs by exploring diverse strategies for dataset construction, training, and test-time scaling. First, we introduce a hybrid data synthesis framework that combines MCTS with judgments from a strong VLM, producing more accurate step-level labels. Second, we propose perception-focused supervision, enabling our PRM to explicitly detect errors at the visual grounding stage of reasoning. Third, we systematically evaluate multiple test-time scaling strategies, showing that our PRMs can reliably guide VLMs toward more accurate solutions. Our experiments covering five diverse multimodal benchmarks (MMMU, PuzzleVQA, AlgoPuzzleVQA, MathVista, and MathVision) reveal several key insights: (i) VL-PRMs when used as Outcome Reward Models (ORMs) during test-time scaling (TTS) can outperform VL-PRM guided process step selection, (ii) smaller VL-PRMs can match or even surpass larger ones in detecting process errors, (iii) VL-PRMs uncover latent reasoning abilities in stronger VLM backbones, (iv) perception-level supervision leads to significant gains in test-time scaling, and (v) TTS performance of different policies improve on advanced math reasoning datasets despite not training VL-PRMs on such datasets. We hope our work will motivate further research and support the advancement of VLMs.

다중모드 추론에서 테스트 시간 스케일링을 위한 비전-언어 프로세스 보상 모델 훈련: 주요 통찰과 교훈

Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned

초록

Support