MMR1:通过方差感知采样与开放资源增强多模态推理能力
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
September 25, 2025
作者: Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, Shijian Lu
cs.AI
摘要
大型多模态推理模型已取得快速进展,但其发展受到两大限制因素的制约:一是缺乏开放、大规模、高质量的长期思维链(CoT)数据;二是强化学习(RL)算法在训练后的不稳定性。作为RL微调的标准框架,群体相对策略优化(GRPO)在奖励方差较低时容易出现梯度消失,削弱了优化信号并影响收敛性。本研究做出了三项贡献:(1)我们提出了方差感知采样(VAS),这是一种由方差提升分数(VPS)指导的数据选择策略,结合结果方差和轨迹多样性,旨在提升奖励方差并稳定策略优化。(2)我们发布了大规模、精心策划的资源,包含约160万条长期CoT冷启动数据和约1.5万条RL问答对,确保质量、难度和多样性,同时提供一个完全可复现的端到端训练代码库。(3)我们开源了一系列不同规模的多模态推理模型,为社区建立了标准化基线。在数学推理基准上的实验验证了所策划数据及VAS的有效性。全面的消融研究和分析进一步揭示了各组成部分的贡献。此外,我们从理论上证明了奖励方差是期望策略梯度幅度的下界,而VAS作为实现这一保证的实用机制。我们的代码、数据和模型检查点可在https://github.com/LengSicong/MMR1获取。
English
Large multimodal reasoning models have achieved rapid progress, but their
advancement is constrained by two major limitations: the absence of open,
large-scale, high-quality long chain-of-thought (CoT) data, and the instability
of reinforcement learning (RL) algorithms in post-training. Group Relative
Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone
to gradient vanishing when reward variance is low, which weakens optimization
signals and impairs convergence. This work makes three contributions: (1) We
propose Variance-Aware Sampling (VAS), a data selection strategy guided by
Variance Promotion Score (VPS) that combines outcome variance and trajectory
diversity to promote reward variance and stabilize policy optimization. (2) We
release large-scale, carefully curated resources containing ~1.6M long CoT
cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty,
and diversity, along with a fully reproducible end-to-end training codebase.
(3) We open-source a family of multimodal reasoning models in multiple scales,
establishing standardized baselines for the community. Experiments across
mathematical reasoning benchmarks demonstrate the effectiveness of both the
curated data and the proposed VAS. Comprehensive ablation studies and analyses
provide further insight into the contributions of each component. In addition,
we theoretically establish that reward variance lower-bounds the expected
policy gradient magnitude, with VAS serving as a practical mechanism to realize
this guarantee. Our code, data, and checkpoints are available at
https://github.com/LengSicong/MMR1.