Vision-Zero:通过战略化游戏式自我对弈实现可扩展的视觉语言模型自我提升
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
September 29, 2025
作者: Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao
cs.AI
摘要
儘管強化學習(RL)能有效提升視覺語言模型(VLMs)的推理能力,現有方法仍高度依賴於需大量人工構建與驗證的勞動密集型數據集,導致訓練成本極高,從而限制了VLMs的實際部署。為應對這一挑戰,我們提出了Vision-Zero,這是一個領域無關的框架,通過從任意圖像對生成的競技視覺遊戲,實現VLM的自我提升。具體而言,Vision-Zero包含三大特性:(1)策略性自我對弈框架:Vision-Zero在“誰是臥底”式遊戲中訓練VLMs,模型在多個角色間進行策略性推理與行動。通過互動遊戲,模型自主生成訓練數據,無需人工標註。(2)基於任意圖像的遊戲生成:與現有的遊戲化框架不同,Vision-Zero能從任意圖像生成遊戲,從而增強模型在跨領域的推理能力,並展現出對不同任務的強大泛化性。我們利用三類截然不同的圖像數據集——基於CLEVR的合成場景、圖表及真實世界圖像,展示了這一多功能性。(3)可持續性能提升:我們引入了迭代自我對弈策略優化(Iterative-SPO),這是一種新穎的訓練算法,交替進行自我對弈與帶有可驗證獎勵的強化學習(RLVR),緩解了僅自我對弈訓練中常見的性能瓶頸,實現了長期的持續改進。儘管使用無標籤數據,Vision-Zero在推理、圖表問答及視覺中心理解任務上均達到了業界領先水平,超越了其他基於標註的方法。模型與代碼已發佈於https://github.com/wangqinsi1/Vision-Zero。
English
Although reinforcement learning (RL) can effectively enhance the reasoning
capabilities of vision-language models (VLMs), current methods remain heavily
dependent on labor-intensive datasets that require extensive manual
construction and verification, leading to extremely high training costs and
consequently constraining the practical deployment of VLMs. To address this
challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM
self-improvement through competitive visual games generated from arbitrary
image pairs. Specifically, Vision-Zero encompasses three main attributes: (1)
Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the
Spy"-style games, where the models engage in strategic reasoning and actions
across multiple roles. Through interactive gameplay, models autonomously
generate their training data without human annotation. (2) Gameplay from
Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate
games from arbitrary images, thereby enhancing the model's reasoning ability
across diverse domains and showing strong generalization to different tasks. We
demonstrate this versatility using three distinct types of image datasets:
CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable
Performance Gain: We introduce Iterative Self-Play Policy Optimization
(Iterative-SPO), a novel training algorithm that alternates between Self-Play
and reinforcement learning with verifiable rewards (RLVR), mitigating the
performance plateau often seen in self-play-only training and achieving
sustained long-term improvements. Despite using label-free data, Vision-Zero
achieves state-of-the-art performance on reasoning, chart question answering,
and vision-centric understanding tasks, surpassing other annotation-based
methods. Models and code has been released at
https://github.com/wangqinsi1/Vision-Zero.