LLaVA-Critic-R1：你的批評模型實則是一個強大的策略模型

摘要

在視覺語言建模中，批評模型通常被訓練來評估輸出——分配標量分數或成對偏好——而非生成回應。這種與生成回應的策略模型的分離是如此根深蒂固，以至於批評模型很少被考慮直接用於策略。在本研究中，我們挑戰這一慣例。我們提出將偏好標記的批評數據集重組為可驗證的訓練信號，並直接在基礎生成模型上進行強化學習，從而產生了LLaVA-Critic-R1，這是一個多模態批評模型，旨在優化偏好判斷的同時保留完整的生成能力。令人驚訝的是，LLaVA-Critic-R1不僅作為頂尖的批評模型脫穎而出，還成為了一個具有競爭力的策略模型——在26個視覺推理和理解基準測試中，匹配或超越了使用領域內數據訓練的專門推理視覺語言模型，相較於其基礎模型（Qwen-2.5-VL-7B）平均提升了+5.7%。將此方法擴展到現有的強大推理視覺語言模型，我們得到了LLaVA-Critic-R1+，它在不犧牲批評質量的前提下進一步提升了策略性能，在7B規模上達到了MMMU的71.9 SoTA性能。最後，我們展示了增強後的批評能力對推理的益處：在測試時應用自我批評，在五個代表性推理任務上平均提升了+13.8%，而無需額外訓練。我們的結果表明，基於批評數據的強化學習訓練可以產生一個在評估和生成方面都表現出色的統一模型，為可擴展、自我改進的多模態系統提供了一條簡單的路徑。

English

In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.

LLaVA-Critic-R1：你的批評模型實則是一個強大的策略模型

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

摘要

Support