ChatPaper.aiChatPaper

多模態流程獎勵模型的訓練資料效率

Training Data Efficiency in Multimodal Process Reward Models

February 4, 2026
作者: Jinyuan Li, Chengsong Huang, Langlin Huang, Shaoyang Xu, Haolin Liu, Wenxuan Zhang, Jiaxin Huang
cs.AI

摘要

多模態流程獎勵模型(MPRMs)是實現多模態大語言模型中視覺推理步驟級監督的核心組件。訓練MPRMs通常需要大規模蒙特卡羅(MC)標註語料庫,這會產生高昂的訓練成本。本文研究MPRM訓練的資料效率問題。初步實驗表明,在隨機抽樣訓練資料的情況下,MPRM訓練效果會快速飽和,這顯示現有MC標註語料庫存在大量冗餘。為解釋此現象,我們建立理論框架並發現:具有資訊量的梯度更新取決於兩個因素——正/負步驟的標籤混合程度與標籤可靠性(正步驟的MC平均分數)。基於這些洞察,我們提出平衡資訊分數(BIS),該方法在無需額外成本的前提下,於推演層級根據現有MC信號同時優先考慮混合度與可靠性。在VisualProcessBench數據集上對兩種骨幹模型(InternVL2.5-8B與Qwen2.5-VL-7B)的實驗表明,經BIS篩選的子集僅需極小比例資料即可持續達到甚至超越完整資料集的性能。值得注意的是,BIS選取的子集僅使用10%訓練資料時便達到完整資料集性能,相較隨機抽樣相對提升4.1%。
English
Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training.Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora.To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.
PDF701February 6, 2026