通过冷启动强化学习推进多模态推理

摘要

近期，大型语言模型（LLMs）的进展展示了令人瞩目的链式思维推理能力，其中强化学习（RL）在这一进程中扮演了关键角色。尽管“顿悟时刻”模式——即模型通过反思实现自我修正——常被归因于RL带来的涌现特性，我们首先揭示，在多模态LLMs（MLLMs）中，这些模式在RL训练之前就已存在，但未必与推理性能的提升直接相关。基于这些发现，我们提出了一项全面研究，通过两阶段方法增强多模态推理能力：（1）以监督微调（SFT）作为冷启动，引入结构化的链式思维推理模式，随后（2）采用GRPO进行强化学习，以进一步精炼这些能力。我们的广泛实验表明，这种组合方法在多项具有挑战性的多模态推理基准测试中，持续超越仅使用SFT或仅RL的方法。最终模型在开源MLLMs中，无论是3B还是7B规模，均达到了顶尖水平，其中7B模型相较于基础模型有显著提升（例如，MathVista上从66.3%提升至73.4%，We-Math上从62.9%提升至70.4%），而3B模型的表现亦能与多个7B模型相媲美。总体而言，本研究为构建先进的多模态推理模型提供了实用指导。我们的代码已发布于https://github.com/waltonfuture/RL-with-Cold-Start。

English

Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %rightarrow73.4 % on MathVista, 62.9 %rightarrow70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.

通过冷启动强化学习推进多模态推理

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

摘要

Support