LLaVA-Critic-R1：您的评判模型实则是一款强大的策略模型

摘要

在视觉-语言建模领域，评判模型通常被训练用于评估输出——赋予标量分数或成对偏好——而非生成响应。这种与负责生成响应的策略模型的分离是如此根深蒂固，以至于评判模型很少被考虑直接用于策略执行。在本研究中，我们挑战了这一传统。我们提出将带有偏好标签的评判数据集重组为可验证的训练信号，并直接在基础生成模型上进行强化学习，从而产生了LLaVA-Critic-R1，这是一个多模态评判模型，旨在优化偏好判断的同时保留完整的生成能力。令人惊讶的是，LLaVA-Critic-R1不仅作为顶级评判模型脱颖而出，还成为了一款具有竞争力的策略模型——在26个视觉推理与理解基准测试中，与使用领域内数据训练的专业推理视觉语言模型（VLMs）相比，它匹配甚至超越了这些模型，相较于其基础模型（Qwen-2.5-VL-7B）平均提升了+5.7%。将这一方法扩展到现有的强大推理VLMs上，我们得到了LLaVA-Critic-R1+，它在不牺牲评判质量的前提下进一步提升了策略性能，在7B规模上实现了MMMU基准测试的71.9分，达到了当前最先进水平。最后，我们展示了增强的评判能力对推理的益处：在测试时应用自我评判，无需额外训练，就在五个代表性推理任务上平均提升了+13.8%。我们的结果表明，基于评判数据的强化学习训练能够产生一个在评估与生成两方面均表现出色的统一模型，为构建可扩展、自我改进的多模态系统提供了一条简洁的路径。

English

In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.

LLaVA-Critic-R1：您的评判模型实则是一款强大的策略模型

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

摘要

Support