MM-Zero：從零數據自演進的多模態視覺語言模型

摘要

自我演化已成為改進大型語言模型（LLM）和視覺語言模型（VLM）等基礎模型的關鍵範式，其特點在於極少需要人為干預。儘管近期研究顯示LLM智能體能在幾乎無數據的情況下從零開始自我演化，但VLM引入的視覺模態通常需要至少少量種子數據（例如圖像）來啟動演化過程。本研究提出MM-Zero框架，首創基於強化學習的零數據VLM推理自我演化方法。有別於過往的雙角色（提案者與求解者）架構，MM-Zero引入多角色自我演化訓練框架，包含三個專業化角色：生成抽象視覺概念並構建問題的提案者、將概念轉譯為可執行代碼（如Python、SVG）以渲染視覺圖像的編碼者，以及對生成視覺內容進行多模態推理的求解者。所有角色皆從同一基礎模型初始化，並透過群組相對策略優化（GRPO）進行訓練，其中精心設計的獎勵機制整合了執行反饋、視覺驗證與難度平衡。實驗結果表明，MM-Zero在多模態基準測試中顯著提升VLM推理性能。該框架為多模態模型開闢了可擴展的自我演化路徑，將自我改進的前沿從傳統的雙模型範式推向新境界。

English

Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.

MM-Zero：從零數據自演進的多模態視覺語言模型

MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

摘要

Support