ChatPaper.aiChatPaper

多模态过程奖励模型中的训练数据效率

Training Data Efficiency in Multimodal Process Reward Models

February 4, 2026
作者: Jinyuan Li, Chengsong Huang, Langlin Huang, Shaoyang Xu, Haolin Liu, Wenxuan Zhang, Jiaxin Huang
cs.AI

摘要

多模态过程奖励模型(MPRM)是指导多模态大语言模型进行视觉推理任务中步骤级监督的核心组件。传统MPRM训练通常需要大规模蒙特卡洛(MC)标注数据集,这会导致高昂的训练成本。本文系统研究了MPRM训练的数据效率问题。初步实验表明,在随机子采样训练数据时,MPRM性能会快速达到饱和点,这揭示出现有MC标注数据集中存在显著冗余。为解释该现象,我们构建理论框架并发现:有效的梯度更新取决于两个关键因素——正负步骤的标签混合程度与标签可靠性(正步骤的平均MC分数)。基于此洞见,我们提出平衡信息分数(BIS),该指标在轨迹层面基于现有MC信号同时优化混合度与可靠性,且无需引入额外成本。在VisualProcessBench基准上,针对InternVL2.5-8B和Qwen2.5-VL-7B两个骨干模型的实验表明:BIS筛选的子集仅需极小比例数据即可达到甚至超越全量数据性能。值得注意的是,BIS子集仅使用10%训练数据即可实现全量数据性能,相较随机子采样相对提升达4.1%。
English
Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training.Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora.To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.
PDF701February 6, 2026