コールドスタートによる強化学習を活用したマルチモーダル推論の進展

要旨

大規模言語モデル（LLM）の最近の進展は、印象的な連鎖的思考推論能力を示しており、その進歩において強化学習（RL）が重要な役割を果たしています。「アハ体験」パターン——モデルが内省を通じて自己修正を示す現象——は、しばしばRLから生じる創発的特性に帰せられますが、我々はまず、これらのパターンがRL訓練前のマルチモーダルLLM（MLLM）にも存在するものの、必ずしも推論性能の向上と相関しないことを実証します。これらの知見を基に、我々はマルチモーダル推論を強化するための二段階アプローチに関する包括的な研究を提示します：（1）構造化された連鎖的思考推論パターンを用いた教師ありファインチューニング（SFT）によるコールドスタート、続いて（2）GRPOによる強化学習を通じてこれらの能力をさらに洗練します。我々の広範な実験は、この組み合わせアプローチが、困難なマルチモーダル推論ベンチマークにおいて、SFTのみまたはRLのみの手法を一貫して上回ることを示しています。結果として得られたモデルは、3Bおよび7BスケールのオープンソースMLLMの中で最先端の性能を達成し、7Bモデルはベースモデルに対して大幅な改善を示し（例：MathVistaで66.3%→73.4%、We-Mathで62.9%→70.4%）、3Bモデルはいくつかの7Bモデルと競合する性能を達成しました。全体として、この研究は高度なマルチモーダル推論モデルを構築するための実践的な指針を提供します。我々のコードはhttps://github.com/waltonfuture/RL-with-Cold-Startで公開されています。

English

Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %rightarrow73.4 % on MathVista, 62.9 %rightarrow70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.

コールドスタートによる強化学習を活用したマルチモーダル推論の進展

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

要旨

Support