LMM-R1:通过两阶段规则强化学习赋能30亿参数语言模型,显著提升推理能力
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
March 10, 2025
作者: Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, Xu Yang
cs.AI
摘要
提升大型多模态模型(LMMs)的推理能力面临独特挑战,这源于视觉感知与逻辑推理之间复杂的相互作用,尤其是在参数规模为3B的紧凑架构中,架构限制制约了推理能力和模态对齐。尽管基于规则的强化学习(RL)在纯文本领域表现出色,但其多模态扩展却遭遇两大关键障碍:(1)由于答案模糊及复杂推理示例稀缺导致的数据限制;(2)多模态预训练引发的基础推理能力下降。
为应对这些挑战,我们提出了\method,一个两阶段框架,通过基础推理增强(FRE)随后进行多模态泛化训练(MGT),将基于规则的RL适应于多模态推理。FRE阶段首先利用纯文本数据和基于规则的RL强化推理能力,随后MGT阶段将这些推理能力泛化至多模态领域。
在Qwen2.5-VL-Instruct-3B上的实验表明,\method在多模态和纯文本基准测试中分别实现了4.83%和4.5%的平均提升,在复杂的足球比赛任务中更是取得了3.63%的增益。这些结果验证了基于文本的推理增强能够有效促进多模态泛化,提供了一种绕过昂贵高质量多模态训练数据的高效范式。
English
Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges
from the complex interplay between visual perception and logical reasoning,
particularly in compact 3B-parameter architectures where architectural
constraints limit reasoning capacity and modality alignment.
While rule-based reinforcement learning (RL) excels in text-only domains, its
multimodal extension confronts two critical barriers: (1) data limitations due
to ambiguous answers and scarce complex reasoning examples, and (2) degraded
foundational reasoning induced by multimodal pretraining.
To address these challenges, we propose \method, a two-stage
framework adapting rule-based RL for multimodal reasoning through
Foundational Reasoning Enhancement (FRE) followed by
Multimodal Generalization Training (MGT). The FRE stage first
strengthens reasoning abilities using text-only data with rule-based RL, then
the MGT stage generalizes these reasoning capabilities to multimodal domains.
Experiments on Qwen2.5-VL-Instruct-3B demonstrate that \method achieves
4.83\% and 4.5\% average improvements over baselines in multimodal and
text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game
tasks. These results validate that text-based reasoning enhancement enables
effective multimodal generalization, offering a data-efficient paradigm that
bypasses costly high-quality multimodal training data.Summary
AI-Generated Summary