MMR1:通過方差感知採樣與開放資源強化多模態推理
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
September 25, 2025
作者: Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, Shijian Lu
cs.AI
摘要
大型多模态推理模型已取得快速進展,但其發展受到兩大限制因素的制約:缺乏開放、大規模、高質量的長鏈思維(CoT)數據,以及強化學習(RL)算法在後訓練階段的不穩定性。作為RL微調的標準框架,群體相對策略優化(GRPO)在獎勵方差較低時容易出現梯度消失,這削弱了優化信號並影響了收斂性。本研究做出了三項貢獻:(1)我們提出了方差感知採樣(VAS),這是一種由方差促進分數(VPS)引導的數據選擇策略,結合結果方差和軌跡多樣性來促進獎勵方差並穩定策略優化。(2)我們發布了大規模、精心策劃的資源,包含約160萬條長CoT冷啟動數據和約1.5萬條RL問答對,旨在確保質量、難度和多樣性,並提供完全可重現的端到端訓練代碼庫。(3)我們開源了一系列多尺度多模态推理模型,為社區建立了標準化基準。在數學推理基準上的實驗證明了所策劃數據和提出的VAS的有效性。全面的消融研究和分析進一步揭示了各組件的貢獻。此外,我們從理論上證明了獎勵方差下界於預期策略梯度幅度,而VAS作為實現這一保證的實用機制。我們的代碼、數據和檢查點可在https://github.com/LengSicong/MMR1獲取。
English
Large multimodal reasoning models have achieved rapid progress, but their
advancement is constrained by two major limitations: the absence of open,
large-scale, high-quality long chain-of-thought (CoT) data, and the instability
of reinforcement learning (RL) algorithms in post-training. Group Relative
Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone
to gradient vanishing when reward variance is low, which weakens optimization
signals and impairs convergence. This work makes three contributions: (1) We
propose Variance-Aware Sampling (VAS), a data selection strategy guided by
Variance Promotion Score (VPS) that combines outcome variance and trajectory
diversity to promote reward variance and stabilize policy optimization. (2) We
release large-scale, carefully curated resources containing ~1.6M long CoT
cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty,
and diversity, along with a fully reproducible end-to-end training codebase.
(3) We open-source a family of multimodal reasoning models in multiple scales,
establishing standardized baselines for the community. Experiments across
mathematical reasoning benchmarks demonstrate the effectiveness of both the
curated data and the proposed VAS. Comprehensive ablation studies and analyses
provide further insight into the contributions of each component. In addition,
we theoretically establish that reward variance lower-bounds the expected
policy gradient magnitude, with VAS serving as a practical mechanism to realize
this guarantee. Our code, data, and checkpoints are available at
https://github.com/LengSicong/MMR1.