推進多模態推理：從優化冷啟動到分階段強化學習

摘要

受到Deepseek-R1在复杂文本任务中卓越推理能力的启发，许多研究尝试通过直接应用强化学习（RL）来激励多模态大语言模型（MLLMs）具备类似的能力。然而，这些方法在激活复杂推理方面仍面临困难。本文并未孤立地考察多模态RL，而是深入探究了当前的训练流程，并识别出三个关键现象：1）有效的冷启动初始化对于提升MLLM推理能力至关重要。有趣的是，我们发现仅使用精心挑选的文本数据进行初始化，其性能即可超越许多近期的多模态推理模型，甚至在引入多模态RL之前。2）应用于多模态RL的标准GRPO遭遇梯度停滞问题，这降低了训练的稳定性和性能。3）在多模态RL阶段之后，进行仅文本的RL训练，能够进一步强化多模态推理能力。这种分阶段的训练方法有效地平衡了感知基础与认知推理的发展。通过整合上述洞见并解决多模态RL中的问题，我们推出了ReVisual-R1，在包括MathVerse、MathVision、WeMath、LogicVista、DynaMath以及极具挑战性的AIME2024和AIME2025等基准测试中，实现了开源7B MLLMs的新一代顶尖水平。

English

Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.

推進多模態推理：從優化冷啟動到分階段強化學習

Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

摘要

Support