MM-Zero：ゼロデータからの自己進化型マルチモーダル視覚言語モデル

要旨

自己進化は、大規模言語モデル（LLM）や視覚言語モデル（VLM）などの基盤モデルを、人間の介入を最小限に抑えながら改善する重要なパラダイムとして登場している。近年のアプローチでは、LLMエージェントがデータをほとんど、あるいは全く使わずにゼロから自己進化できることが実証されているが、VLMには視覚モダリティが追加されるため、通常、画像などの何らかのシードデータを必要とし、自己進化プロセスをブートストラップする必要がある。本研究では、VLMの推論においてゼロデータ自己進化を実現する初の強化学習ベースのフレームワークであるMM-Zeroを提案する。従来の二役（提案者と解決者）構成を超えて、MM-Zeroは、抽象的な視覚概念を生成し質問を定式化する提案者、これらの概念を実行可能なコード（Python、SVGなど）に変換して視覚画像をレンダリングするコーダー、生成された視覚コンテンツに対してマルチモーダル推論を実行する解決者という、3つの専門役割から構成される多役自己進化トレーニングフレームワークを導入する。これら3つの役割はすべて同一の基底モデルから初期化され、実行フィードバック、視覚的検証、難易度調整を統合した注意深く設計された報酬メカニズムを用いたGroup Relative Policy Optimization（GRPO）によって訓練される。実験の結果、MM-Zeroは多様なマルチモーダルベンチマークにおいてVLMの推論性能を向上させることが示された。MM-Zeroは、マルチモーダルモデルのための自己進化するマルチモデルシステムへのスケーラブルな道筋を確立し、従来の二モデルパラダイムを超えた自己改善の新たなフロンティアを拡大するものである。

English

Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.

MM-Zero：ゼロデータからの自己進化型マルチモーダル視覚言語モデル

MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

要旨

Support