MM-Zero：从零数据自演进的多模态视觉语言模型

摘要

自我进化已成为提升大型语言模型（LLM）和视觉语言模型（VLM）等基础模型性能的关键范式，其特点在于最大限度减少人工干预。虽然现有研究表明LLM智能体能够在零数据或极少数据条件下从零开始自我进化，但VLM引入的视觉模态通常需要至少少量种子数据（如图像）来启动进化过程。本研究提出多模态零数据框架MM-Zero，这是首个基于强化学习实现VLM零数据自我推理进化的方法。相较于传统的双角色（提议者与求解者）架构，MM-Zero创新性地构建了包含三个专业化角色的自我进化训练框架：提议者负责生成抽象视觉概念并构建问题，编码者将这些概念转化为可执行代码（如Python、SVG）以生成视觉图像，求解者则对生成的可视化内容进行多模态推理。所有角色均初始化为同一基座模型，并通过群组相对策略优化（GRPO）进行训练，其中精心设计的奖励机制融合了执行反馈、视觉验证与难度平衡。实验表明，MM-Zero在多项多模态基准测试中显著提升了VLM的推理性能。该框架为多模态模型建立了可扩展的自我进化路径，将自我改进的前沿从传统的双模型范式推向多模型系统新境界。

English

Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.