MMR1：分散認識サンプリングとオープンリソースによるマルチモーダル推論の強化

要旨

大規模マルチモーダル推論モデルは急速な進歩を遂げていますが、その発展は2つの主要な制約によって妨げられています。1つ目は、オープンで大規模かつ高品質な長い連鎖思考（CoT）データの欠如、2つ目はポストトレーニングにおける強化学習（RL）アルゴリズムの不安定性です。RLファインチューニングの標準フレームワークであるGroup Relative Policy Optimization（GRPO）は、報酬分散が低い場合に勾配消失が起こりやすく、最適化信号が弱まり収束が妨げられます。本研究では以下の3つの貢献を行います。(1) 分散促進スコア（VPS）に基づくデータ選択戦略であるVariance-Aware Sampling（VAS）を提案し、報酬分散を促進しポリシー最適化を安定化させます。VPSは結果分散と軌道多様性を組み合わせたものです。(2) 品質、難易度、多様性を確保した約160万件の長いCoTコールドスタートデータと約1万5千件のRL QAペアを含む大規模で慎重にキュレートされたリソースを公開し、完全に再現可能なエンドツーエンドのトレーニングコードベースを提供します。(3) 複数のスケールでマルチモーダル推論モデルのファミリーをオープンソース化し、コミュニティのための標準化されたベースラインを確立します。数学的推論ベンチマークにおける実験は、キュレートされたデータと提案されたVASの有効性を実証しています。包括的なアブレーション研究と分析により、各コンポーネントの貢献についてさらに洞察を提供します。さらに、報酬分散が期待されるポリシー勾配の大きさの下限を定めることを理論的に確立し、VASがこの保証を実現するための実践的なメカニズムとして機能することを示します。私たちのコード、データ、チェックポイントはhttps://github.com/LengSicong/MMR1で公開されています。

English

Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.

MMR1：分散認識サンプリングとオープンリソースによるマルチモーダル推論の強化

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

要旨

Support