MMR1：通过方差感知采样与开放资源增强多模态推理能力

摘要

大型多模态推理模型已取得快速进展，但其发展受到两大限制因素的制约：一是缺乏开放、大规模、高质量的长期思维链（CoT）数据；二是强化学习（RL）算法在训练后的不稳定性。作为RL微调的标准框架，群体相对策略优化（GRPO）在奖励方差较低时容易出现梯度消失，削弱了优化信号并影响收敛性。本研究做出了三项贡献：（1）我们提出了方差感知采样（VAS），这是一种由方差提升分数（VPS）指导的数据选择策略，结合结果方差和轨迹多样性，旨在提升奖励方差并稳定策略优化。（2）我们发布了大规模、精心策划的资源，包含约160万条长期CoT冷启动数据和约1.5万条RL问答对，确保质量、难度和多样性，同时提供一个完全可复现的端到端训练代码库。（3）我们开源了一系列不同规模的多模态推理模型，为社区建立了标准化基线。在数学推理基准上的实验验证了所策划数据及VAS的有效性。全面的消融研究和分析进一步揭示了各组成部分的贡献。此外，我们从理论上证明了奖励方差是期望策略梯度幅度的下界，而VAS作为实现这一保证的实用机制。我们的代码、数据和模型检查点可在https://github.com/LengSicong/MMR1获取。

English

Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.

MMR1：通过方差感知采样与开放资源增强多模态推理能力

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

摘要

Support