Vision-R1: 멀티모달 대형 언어 모델의 추론 능력 강화를 위한 인센티브 설계

초록

DeepSeek-R1-Zero는 순수하게 강화 학습(Reinforcement Learning, RL)을 통해 대형 언어 모델(LLM)에서 추론 능력이 발현되는 것을 성공적으로 입증했습니다. 이 획기적인 성과에 영감을 받아, 우리는 RL을 활용하여 다중모달 언어 모델(MLLM)의 추론 능력을 향상시키는 방법을 탐구합니다. 그러나 RL을 통한 직접적인 학습은 고품질의 다중모달 추론 데이터가 부족하기 때문에, MLLM에서 질문 및 성찰과 같은 복잡한 추론 능력을 활성화하는 데 어려움을 겪습니다. 이 문제를 해결하기 위해, 우리는 다중모달 추론 능력을 개선하기 위한 추론 MLLM인 Vision-R1을 제안합니다. 구체적으로, 우리는 먼저 기존의 MLLM과 DeepSeek-R1을 활용하여 인간 주석 없이 고품질의 다중모달 CoT(Chain-of-Thought) 데이터셋을 구축합니다. 이를 위해 모달리티 브리징과 데이터 필터링을 통해 200K 규모의 다중모달 CoT 데이터셋인 Vision-R1-cold 데이터셋을 생성합니다. 이 데이터셋은 Vision-R1의 콜드 스타트 초기화 데이터로 사용됩니다. 콜드 스타트 이후 과도한 사고로 인한 최적화 문제를 완화하기 위해, 우리는 점진적 사고 억제 훈련(Progressive Thinking Suppression Training, PTST) 전략을 제안하고, 그룹 상대 정책 최적화(Group Relative Policy Optimization, GRPO)와 하드 포맷팅 결과 보상 함수를 사용하여 10K 규모의 다중모달 수학 데이터셋에서 모델이 올바르고 복잡한 추론 과정을 학습할 수 있도록 점진적으로 개선합니다. 포괄적인 실험 결과, 우리의 모델은 다양한 다중모달 수학 추론 벤치마크에서 평균 약 6%의 성능 향상을 달성했습니다. Vision-R1-7B는 널리 사용되는 MathVista 벤치마크에서 73.5%의 정확도를 기록했으며, 이는 선두 추론 모델인 OpenAI O1보다 단 0.4% 낮은 수치입니다. 데이터셋과 코드는 https://github.com/Osilly/Vision-R1에서 공개될 예정입니다.

English

DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of sim6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. The datasets and code will be released in: https://github.com/Osilly/Vision-R1 .

Vision-R1: 멀티모달 대형 언어 모델의 추론 능력 강화를 위한 인센티브 설계

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

초록

Support