다중모달 추론의 발전: 최적화된 콜드 스타트에서 단계적 강화 학습으로

초록

복잡한 텍스트 작업에서 Deepseek-R1이 보여준 놀라운 추론 능력에 영감을 받아, 많은 연구들이 다중모드 대형 언어 모델(MLLM)에서 유사한 능력을 유도하기 위해 강화 학습(RL)을 직접 적용하려 시도하고 있습니다. 그러나 이러한 접근법들은 여전히 복잡한 추론을 활성화하는 데 어려움을 겪고 있습니다. 본 논문에서는 다중모드 RL을 단독으로 검토하는 대신, 현재의 학습 파이프라인을 깊이 파고들어 세 가지 중요한 현상을 확인했습니다: 1) 효과적인 콜드 스타트 초기화는 MLLM의 추론 능력 향상에 매우 중요합니다. 흥미롭게도, 신중하게 선택된 텍스트 데이터만으로 초기화하는 것만으로도 다중모드 RL 이전 단계에서 최근의 많은 다중모드 추론 모델을 능가하는 성능을 달성할 수 있음을 발견했습니다. 2) 다중모드 RL에 적용된 표준 GRPO는 그래디언트 정체 현상을 겪으며, 이는 학습 안정성과 성능을 저하시킵니다. 3) 다중모드 RL 단계 이후에 이어지는 텍스트 전용 RL 학습은 다중모드 추론 능력을 더욱 향상시킵니다. 이러한 단계적 학습 접근법은 지각적 기반과 인지적 추론 개발을 효과적으로 균형 있게 조율합니다. 위의 통찰을 통합하고 다중모드 RL의 문제점을 해결함으로써, 우리는 ReVisual-R1을 소개하며, MathVerse, MathVision, WeMath, LogicVista, DynaMath 및 도전적인 AIME2024와 AIME2025를 포함한 까다로운 벤치마크에서 오픈소스 7B MLLM 중 새로운 최첨단 성능을 달성했습니다.

English

Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.

다중모달 추론의 발전: 최적화된 콜드 스타트에서 단계적 강화 학습으로

Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

초록

Support