MMR1: 분산 인식 샘플링과 오픈 리소스를 활용한 다중모달 추론 강화

초록

대규모 다중모달 추론 모델은 빠른 발전을 이루었지만, 두 가지 주요 한계로 인해 그 진보가 제약받고 있습니다: 개방형 대규모 고품질 장기 사고 사슬(CoT) 데이터의 부재와 사후 훈련에서 강화학습(RL) 알고리즘의 불안정성이 그것입니다. RL 미세 조정을 위한 표준 프레임워크인 그룹 상대 정책 최적화(GRPO)는 보상 분산이 낮을 때 기울기 소실이 발생하기 쉬워 최적화 신호가 약화되고 수렴이 저해됩니다. 본 연구는 세 가지 기여를 합니다: (1) 결과 분산과 궤적 다양성을 결합하여 보상 분산을 촉진하고 정책 최적화를 안정화하는 분산 촉진 점수(VPS)에 기반한 데이터 선택 전략인 분산 인식 샘플링(VAS)을 제안합니다. (2) 품질, 난이도, 다양성을 보장하도록 설계된 ~160만 개의 장기 CoT 콜드 스타트 데이터와 ~15,000개의 RL QA 쌍을 포함한 대규모의 신중하게 선별된 리소스와 완전히 재현 가능한 종단 간 훈련 코드베이스를 공개합니다. (3) 다양한 규모의 다중모달 추론 모델 패밀리를 오픈소스로 제공하여 커뮤니티를 위한 표준화된 기준을 확립합니다. 수학적 추론 벤치마크를 통한 실험은 선별된 데이터와 제안된 VAS의 효과를 입증합니다. 포괄적인 제거 연구와 분석은 각 구성 요소의 기여에 대한 추가적인 통찰을 제공합니다. 또한, 보상 분산이 기대 정책 기울기 크기의 하한을 형성하며, VAS가 이를 실현하는 실용적인 메커니즘으로 작용함을 이론적으로 입증합니다. 우리의 코드, 데이터, 체크포인트는 https://github.com/LengSicong/MMR1에서 확인할 수 있습니다.

English

Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.

MMR1: 분산 인식 샘플링과 오픈 리소스를 활용한 다중모달 추론 강화

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

초록

Support