通过感知中心过程奖励模型提升视觉语言模型性能

摘要

近年来，基于可验证奖励的强化学习（RLVR）技术显著提升了视觉语言模型（VLM）的复杂推理能力。然而，其结果级监督过于粗略，难以诊断和修正推理链中的错误。为此，我们提出Perceval——一种可实现词元级错误定位的过程奖励模型（PRM），该模型能从模型响应中提取图像相关主张，并将其与图像中的视觉证据逐项比对，最终返回存在感知错误的论断。Perceval通过感知密集型监督训练数据进行训练，随后被集成至强化学习训练流程中以训练策略模型。具体而言，相较于传统GRPO采用序列级优势函数的方法，我们通过针对Perceval识别的幻觉片段施加惩罚来实现词元级优势函数，从而提供细粒度监督信号。除增强训练过程外，Perceval还可在推理阶段辅助VLM：通过截断模型响应中的错误片段，直接令模型重新生成响应或引导其对先前输出进行反思。该过程可重复多次以实现测试时扩展。实验表明，经RL训练的多类推理VLM在跨领域基准测试中均取得显著提升，印证了以感知为核心监督策略的普适性价值。在测试时扩展方面，该方法相较多数投票等策略也展现出持续的性能增益。我们的代码与数据已公开于https://github.com/RUCAIBox/Perceval。

English

Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model's response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at https://github.com/RUCAIBox/Perceval.