VLM-R1: 안정적이고 일반화 가능한 R1 스타일 대형 시각-언어 모델

초록

최근 DeepSeek R1은 간단하면서도 효과적인 설계를 통해 강화 학습(Reinforcement Learning, RL)이 대규모 언어 모델(Large Language Models, LLMs)의 추론 능력을 크게 향상시킬 수 있음을 보여주었습니다. R1의 핵심은 결정론적 정답을 가진 작업을 활용하여 정확하고 안정적인 보상 계산을 가능하게 하는 규칙 기반 보상 공식에 있습니다. 시각 영역에서도 우리는 다양한 시각 이해 작업이 본질적으로 잘 정의된 정답 주석을 갖추고 있음을 관찰했습니다. 이러한 특성은 이들 작업이 규칙 기반 보상 메커니즘과 자연스럽게 호환되도록 만듭니다. 이러한 관찰에 동기를 받아, 우리는 R1 스타일의 강화 학습을 시각-언어 모델(Vision-Language Models, VLMs)로 확장하여 이들의 시각적 추론 능력을 향상시키는 방법을 연구했습니다. 이를 위해, 우리는 VLMs의 일반적인 시각-언어 작업 성능을 개선하기 위해 RL을 활용하기 위한 전용 프레임워크인 VLM-R1을 개발했습니다. 이 프레임워크를 사용하여, 우리는 시각 영역에 RL을 적용하는 가능성을 추가로 탐구했습니다. 실험 결과는 RL 기반 모델이 시각 이해 작업에서 경쟁력 있는 성능을 제공할 뿐만 아니라 일반화 능력에서 지도 미세 조정(Supervised Fine-Tuning, SFT)을 능가함을 보여줍니다. 더 나아가, 우리는 포괄적인 제거 연구를 수행하여 객체 탐지에서의 보상 해킹(reward hacking) 현상, "OD 아하 모먼트"의 출현, 훈련 데이터 품질의 영향, 그리고 다양한 모델 크기에 걸친 RL의 확장 행동 등 일련의 주목할 만한 통찰을 발견했습니다. 이러한 분석을 통해, 우리는 강화 학습이 시각-언어 모델의 능력을 어떻게 향상시키는지에 대한 이해를 깊이 하고자 하며, 우리의 연구 결과와 오픈소스 기여가 시각-언어 RL 커뮤니티의 지속적인 발전을 지원하기를 바랍니다. 우리의 코드와 모델은 https://github.com/om-ai-lab/VLM-R1에서 확인할 수 있습니다.

English

Recently DeepSeek R1 has shown that reinforcement learning (RL) can substantially improve the reasoning capabilities of Large Language Models (LLMs) through a simple yet effective design. The core of R1 lies in its rule-based reward formulation, which leverages tasks with deterministic ground-truth answers to enable precise and stable reward computation. In the visual domain, we similarly observe that a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations. This property makes them naturally compatible with rule-based reward mechanisms. Motivated by this observation, we investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs), aiming to enhance their visual reasoning capabilities. To this end, we develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks. Using this framework, we further explore the feasibility of applying RL to visual domain. Experimental results indicate that the RL-based model not only delivers competitive performance on visual understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in generalization ability. Furthermore, we conduct comprehensive ablation studies that uncover a series of noteworthy insights, including the presence of reward hacking in object detection, the emergence of the "OD aha moment", the impact of training data quality, and the scaling behavior of RL across different model sizes. Through these analyses, we aim to deepen the understanding of how reinforcement learning enhances the capabilities of vision-language models, and we hope our findings and open-source contributions will support continued progress in the vision-language RL community. Our code and model are available at https://github.com/om-ai-lab/VLM-R1

VLM-R1: 안정적이고 일반화 가능한 R1 스타일 대형 시각-언어 모델

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

초록

Support