透過冷啟動強化學習推進多模態推理

摘要

近期，大型语言模型（LLMs）的进展展示了令人瞩目的链式思维推理能力，其中强化学习（RL）在这一进程中扮演了关键角色。尽管“顿悟时刻”模式——即模型通过反思进行自我修正——常被归因于RL带来的涌现特性，但我们首先证明了这些模式在RL训练前的多模态LLMs（MLLMs）中就已存在，却未必与推理性能的提升直接相关。基于这些洞察，我们提出了一项全面研究，通过两阶段方法增强多模态推理能力：（1）以监督微调（SFT）作为冷启动，采用结构化的链式思维推理模式，随后（2）通过GRPO进行强化学习，以进一步精炼这些能力。我们的大量实验表明，这种组合方法在多项挑战性的多模态推理基准测试中，持续优于仅使用SFT或仅使用RL的方法。由此产生的模型在3B和7B规模的开源MLLMs中均达到了最先进的性能，其中我们的7B模型相较于基础模型有显著提升（例如，MathVista上从66.3%提升至73.4%，We-Math上从62.9%提升至70.4%），而我们的3B模型则与多个7B模型竞争性能。总体而言，这项工作为构建先进的多模态推理模型提供了实用指导。我们的代码可在https://github.com/waltonfuture/RL-with-Cold-Start获取。

English

Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %rightarrow73.4 % on MathVista, 62.9 %rightarrow70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.

透過冷啟動強化學習推進多模態推理

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

摘要

Support