マルチモーダル推論の進化：最適化されたコールドスタートから段階的強化学習へ

要旨

複雑なテキストタスクにおけるDeepseek-R1の卓越した推論能力に触発され、多くの研究がマルチモーダル大規模言語モデル（MLLM）において同様の能力を引き出すために、直接的に強化学習（RL）を適用しようと試みています。しかし、それらは依然として複雑な推論を活性化することに苦戦しています。本論文では、マルチモーダルRLを単独で検討するのではなく、現在のトレーニングパイプラインを深く掘り下げ、以下の3つの重要な現象を特定しました：1）効果的なコールドスタート初期化は、MLLMの推論能力を向上させるために極めて重要です。興味深いことに、慎重に選ばれたテキストデータのみで初期化することで、マルチモーダルRLを適用する前でも、多くの最近のマルチモーダル推論モデルを上回る性能が得られることがわかりました。2）マルチモーダルRLに適用される標準的なGRPOは、勾配停滞に悩まされ、トレーニングの安定性と性能を低下させます。3）マルチモーダルRLフェーズの後に続くテキストのみのRLトレーニングは、マルチモーダル推論をさらに向上させます。この段階的なトレーニングアプローチは、知覚的基盤と認知的推論の発展を効果的にバランスさせます。上記の洞察を取り入れ、マルチモーダルRLの問題に対処することで、我々はReVisual-R1を導入し、MathVerse、MathVision、WeMath、LogicVista、DynaMath、そして挑戦的なAIME2024およびAIME2025を含む困難なベンチマークにおいて、オープンソースの7B MLLMの中で新たな最先端を達成しました。

English

Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.

マルチモーダル推論の進化：最適化されたコールドスタートから段階的強化学習へ

Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

要旨

Support