Vision-Zero：通过战略化游戏式自我对弈实现可扩展的视觉语言模型自我提升

摘要

儘管強化學習（RL）能有效提升視覺語言模型（VLMs）的推理能力，現有方法仍高度依賴於需大量人工構建與驗證的勞動密集型數據集，導致訓練成本極高，從而限制了VLMs的實際部署。為應對這一挑戰，我們提出了Vision-Zero，這是一個領域無關的框架，通過從任意圖像對生成的競技視覺遊戲，實現VLM的自我提升。具體而言，Vision-Zero包含三大特性：（1）策略性自我對弈框架：Vision-Zero在“誰是臥底”式遊戲中訓練VLMs，模型在多個角色間進行策略性推理與行動。通過互動遊戲，模型自主生成訓練數據，無需人工標註。（2）基於任意圖像的遊戲生成：與現有的遊戲化框架不同，Vision-Zero能從任意圖像生成遊戲，從而增強模型在跨領域的推理能力，並展現出對不同任務的強大泛化性。我們利用三類截然不同的圖像數據集——基於CLEVR的合成場景、圖表及真實世界圖像，展示了這一多功能性。（3）可持續性能提升：我們引入了迭代自我對弈策略優化（Iterative-SPO），這是一種新穎的訓練算法，交替進行自我對弈與帶有可驗證獎勵的強化學習（RLVR），緩解了僅自我對弈訓練中常見的性能瓶頸，實現了長期的持續改進。儘管使用無標籤數據，Vision-Zero在推理、圖表問答及視覺中心理解任務上均達到了業界領先水平，超越了其他基於標註的方法。模型與代碼已發佈於https://github.com/wangqinsi1/Vision-Zero。

English

Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.

Vision-Zero：通过战略化游戏式自我对弈实现可扩展的视觉语言模型自我提升

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

摘要

Support