推进多模态推理：从优化冷启动到分阶段强化学习

摘要

受Deepseek-R1在复杂文本任务中展现出的卓越推理能力启发，众多研究尝试通过直接应用强化学习（RL）来激发多模态大语言模型（MLLMs）的类似能力。然而，这些方法在激活复杂推理方面仍面临挑战。本文并未孤立地探讨多模态RL，而是深入分析了当前的训练流程，揭示了三个关键现象：1）有效的冷启动初始化对于提升MLLM推理能力至关重要。有趣的是，我们发现仅使用精心挑选的文本数据进行初始化，其性能即可超越许多近期的多模态推理模型，甚至在实施多模态RL之前。2）标准GRPO应用于多模态RL时，存在梯度停滞问题，这降低了训练的稳定性和性能。3）在多模态RL阶段之后，进行仅文本的RL训练，能进一步强化多模态推理能力。这种分阶段训练方法有效平衡了感知基础与认知推理的发展。基于上述洞见并解决多模态RL问题，我们推出了ReVisual-R1，在包括MathVerse、MathVision、WeMath、LogicVista、DynaMath以及极具挑战性的AIME2024和AIME2025在内的多个基准测试中，实现了开源7B MLLMs的新巅峰。

English

Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.

推进多模态推理：从优化冷启动到分阶段强化学习

Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

摘要

Support