LMM-R1: 2段階のルールベース強化学習による3B LMMの強力な推論能力の強化

要旨

大規模マルチモーダルモデル（LMMs）における推論能力の向上は、視覚的知覚と論理的推論の複雑な相互作用から生じる特有の課題に直面しています。特に、3Bパラメータ規模のコンパクトなアーキテクチャでは、アーキテクチャ上の制約が推論能力とモダリティ間の整合性を制限しています。ルールベースの強化学習（RL）はテキストのみの領域では優れた性能を発揮しますが、マルチモーダル領域への拡張においては、以下の2つの重要な障壁に直面します：(1) 曖昧な回答や複雑な推論事例の不足によるデータ制約、(2) マルチモーダル事前学習によって引き起こされる基礎的推論能力の低下。これらの課題に対処するため、我々は\methodを提案します。これは、ルールベースのRLをマルチモーダル推論に適応させるための2段階フレームワークであり、まず「基礎的推論強化（FRE）」を行い、その後「マルチモーダル汎化訓練（MGT）」を実施します。FRE段階では、テキストのみのデータを用いてルールベースのRLで推論能力を強化し、MGT段階ではこれらの推論能力をマルチモーダル領域に汎化させます。 Qwen2.5-VL-Instruct-3Bを用いた実験では、\methodがマルチモーダルおよびテキストのみのベンチマークにおいて、それぞれ4.83％と4.5％の平均的な改善を達成し、複雑なFootball Gameタスクでは3.63％の向上を示しました。これらの結果は、テキストベースの推論強化が効果的なマルチモーダル汎化を可能にし、高品質なマルチモーダル訓練データのコストを回避するデータ効率的なパラダイムを提供することを実証しています。

English

Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \method, a two-stage framework adapting rule-based RL for multimodal reasoning through Foundational Reasoning Enhancement (FRE) followed by Multimodal Generalization Training (MGT). The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that \method achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.

LMM-R1: 2段階のルールベース強化学習による3B LMMの強力な推論能力の強化

LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

要旨

Support