LLaVA-Critic-R1: 당신의 비평 모델은 사실 강력한 정책 모델입니다

초록

비전-언어 모델링에서 비평 모델(critic model)은 일반적으로 응답을 생성하기보다는 출력을 평가하기 위해 훈련됩니다. 이때 스칼라 점수를 부여하거나 쌍별 선호도를 평가하는 방식이 주로 사용됩니다. 이러한 비평 모델은 응답을 생성하는 정책 모델(policy model)과 분리되어 있어, 비평 모델이 직접 정책으로 사용되는 경우는 거의 없습니다. 본 연구에서는 이러한 관례에 도전합니다. 우리는 선호도 레이블이 지정된 비평 데이터셋을 검증 가능한 훈련 신호로 재구성하고, 기본 생성 모델에 직접 강화 학습을 수행하여 LLaVA-Critic-R1을 제안합니다. 이는 선호도 판단을 최적화하면서도 완전한 생성 능력을 유지하는 다중모달 비평 모델입니다. 놀랍게도, LLaVA-Critic-R1은 최고 수준의 비평 모델로 등장할 뿐만 아니라, 26개의 시각적 추론 및 이해 벤치마크에서 도메인 내 데이터로 훈련된 전문화된 추론 VLM(비전-언어 모델)을 능가하거나 동등한 성능을 보이는 경쟁력 있는 정책 모델로도 나타났습니다. 이는 기본 모델(Qwen-2.5-VL-7B) 대비 평균 +5.7%의 성능 향상을 보였습니다. 이 접근법을 기존의 강력한 추론 VLM에 확장하여 LLaVA-Critic-R1+를 개발했으며, 이는 비평 품질을 희생하지 않으면서 정책 성능을 더욱 향상시켜 7B 규모에서 MMMU 벤치마크에서 71.9의 SoTA(State-of-the-Art) 성능을 달성했습니다. 마지막으로, 향상된 비평 능력이 추론에 도움이 됨을 보였습니다: 테스트 시점에 자기 비평(self-critique)을 적용하면 추가 훈련 없이도 5개의 대표적인 추론 작업에서 평균 +13.8%의 성능 향상을 얻을 수 있었습니다. 우리의 결과는 비평 데이터에 대한 강화 학습 훈련이 평가와 생성 모두에서 뛰어난 통합 모델을 생성할 수 있음을 보여주며, 확장 가능하고 자기 개선이 가능한 다중모달 시스템을 위한 간단한 경로를 제시합니다.

English

In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.

LLaVA-Critic-R1: 당신의 비평 모델은 사실 강력한 정책 모델입니다

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

초록

Support