ChatPaper.aiChatPaper

Vision-Zero:通过战略性游戏化自我对弈实现可扩展的视觉语言模型自我提升

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

September 29, 2025
作者: Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao
cs.AI

摘要

尽管强化学习(RL)能有效提升视觉语言模型(VLMs)的推理能力,但现有方法仍高度依赖需大量人工构建与验证的数据集,导致训练成本极高,从而限制了VLMs的实际应用。为应对这一挑战,我们提出了Vision-Zero,一个领域无关的框架,通过任意图像对生成的竞争性视觉游戏,实现VLM的自我提升。具体而言,Vision-Zero包含三大特性:(1)策略性自博弈框架:Vision-Zero在“谁是卧底”类游戏中训练VLMs,模型在多个角色间进行策略推理与行动。通过互动游戏,模型无需人工标注即可自主生成训练数据。(2)任意图像生成游戏:与现有游戏化框架不同,Vision-Zero能从任意图像生成游戏,从而增强模型跨领域的推理能力,并展现出对不同任务的强大泛化性。我们利用CLEVR合成场景、图表及真实世界图像三类数据集展示了这一多功能性。(3)可持续性能提升:我们引入了迭代自博弈策略优化(Iterative-SPO),一种新颖的训练算法,交替进行自博弈与带可验证奖励的强化学习(RLVR),缓解了仅自博弈训练中常见的性能瓶颈,实现了长期的持续改进。尽管使用无标签数据,Vision-Zero在推理、图表问答及视觉中心理解任务上均达到了最先进的性能,超越了其他基于标注的方法。模型与代码已发布于https://github.com/wangqinsi1/Vision-Zero。
English
Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.
PDF551October 1, 2025