LLaVA-Critic-R1：あなたの批評モデルは実は強力なポリシーモデルである

要旨

視覚言語モデリングにおいて、批評モデルは通常、応答を生成するのではなく、出力を評価するために訓練されます。具体的には、スカラー値のスコアを付与したり、ペアワイズの選好を判断したりします。このような批評モデルと、応答を生成するポリシーモデルとの分離は非常に定着しており、批評モデルが直接ポリシーとして使用されることはほとんどありません。本研究では、この慣習に挑戦します。選好ラベル付きの批評データセットを検証可能な訓練信号に再編成し、ベースとなる生成モデルに対して直接強化学習を行うことで、LLaVA-Critic-R1を提案します。これは、選好判断を最適化しながら完全な生成能力を保持するマルチモーダル批評モデルです。驚くべきことに、LLaVA-Critic-R1は、トップクラスの批評モデルとしてだけでなく、競争力のあるポリシーモデルとしても登場しました。26の視覚的推論と理解のベンチマークにおいて、ドメイン内データで訓練された専門的な推論VLMを匹敵または上回り、ベースモデル（Qwen-2.5-VL-7B）に対して平均+5.7%の向上を示しました。このアプローチを既存の強力な推論VLMに拡張することで、LLaVA-Critic-R1+を開発し、批評品質を犠牲にすることなくポリシーパフォーマンスをさらに向上させ、7BスケールでMMMUにおいて71.9のSoTA性能を達成しました。最後に、強化された批評能力が推論に有益であることを示します。テスト時に自己批評を適用することで、追加の訓練なしに5つの代表的な推論タスクで平均+13.8%の改善が得られました。我々の結果は、批評データに対するRL訓練が、評価と生成の両方に優れた統一モデルを生み出すことができることを明らかにし、スケーラブルで自己改善型のマルチモーダルシステムに向けたシンプルな道筋を提供します。

English

In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.

LLaVA-Critic-R1：あなたの批評モデルは実は強力なポリシーモデルである

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

要旨

Support