LLaVA-Critic-R1:您的评判模型实则是一款强大的策略模型
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
August 31, 2025
作者: Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, Furong Huang
cs.AI
摘要
在视觉-语言建模领域,评判模型通常被训练用于评估输出——赋予标量分数或成对偏好——而非生成响应。这种与负责生成响应的策略模型的分离是如此根深蒂固,以至于评判模型很少被考虑直接用于策略执行。在本研究中,我们挑战了这一传统。我们提出将带有偏好标签的评判数据集重组为可验证的训练信号,并直接在基础生成模型上进行强化学习,从而产生了LLaVA-Critic-R1,这是一个多模态评判模型,旨在优化偏好判断的同时保留完整的生成能力。令人惊讶的是,LLaVA-Critic-R1不仅作为顶级评判模型脱颖而出,还成为了一款具有竞争力的策略模型——在26个视觉推理与理解基准测试中,与使用领域内数据训练的专业推理视觉语言模型(VLMs)相比,它匹配甚至超越了这些模型,相较于其基础模型(Qwen-2.5-VL-7B)平均提升了+5.7%。将这一方法扩展到现有的强大推理VLMs上,我们得到了LLaVA-Critic-R1+,它在不牺牲评判质量的前提下进一步提升了策略性能,在7B规模上实现了MMMU基准测试的71.9分,达到了当前最先进水平。最后,我们展示了增强的评判能力对推理的益处:在测试时应用自我评判,无需额外训练,就在五个代表性推理任务上平均提升了+13.8%。我们的结果表明,基于评判数据的强化学习训练能够产生一个在评估与生成两方面均表现出色的统一模型,为构建可扩展、自我改进的多模态系统提供了一条简洁的路径。
English
In vision-language modeling, critic models are typically trained to evaluate
outputs -- assigning scalar scores or pairwise preferences -- rather than to
generate responses. This separation from policy models, which produce the
responses, is so entrenched that critics are rarely considered for direct
policy use. In this work, we challenge this convention. We propose to
reorganize preference-labeled critic datasets into verifiable training signals
and perform reinforcement learning directly on a base generative model,
producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference
judgments while retaining full generation ability. Surprisingly,
LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a
competitive policy model -- matching or surpassing specialized reasoning VLMs
trained with in-domain data across 26 visual reasoning and understanding
benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B).
Extending this approach to existing strong reasoning VLMs yields
LLaVA-Critic-R1+, which further advances policy performance without sacrificing
critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale.
Finally, we show that the enhanced critic ability benefits inference: applying
self-critique at test time yields an average +13.8% improvement on five
representative reasoning tasks without additional training. Our results reveal
that RL training on critic data can produce a unified model excelling at both
evaluation and generation, offering a simple path toward scalable,
self-improving multimodal systems.