Vision-Zero: 戦略的ゲーミフィケーション自己対戦によるスケーラブルなVLM自己改善

要旨

強化学習（RL）は視覚言語モデル（VLM）の推論能力を効果的に向上させることができるが、現在の手法は依然として、広範な手作業による構築と検証を必要とする労力集約的なデータセットに大きく依存しており、極めて高いトレーニングコストを招き、結果としてVLMの実用的な展開を制約している。この課題に対処するため、我々はVision-Zeroを提案する。これは、任意の画像ペアから生成された競争的な視覚ゲームを通じてVLMの自己改善を可能にするドメイン非依存のフレームワークである。具体的には、Vision-Zeroは以下の3つの主要な特徴を備えている：(1) 戦略的セルフプレイフレームワーク：Vision-Zeroは「Who Is the Spy」スタイルのゲームでVLMをトレーニングし、モデルが複数の役割で戦略的推論と行動を行う。インタラクティブなゲームプレイを通じて、モデルは人間のアノテーションなしでトレーニングデータを自律的に生成する。(2) 任意の画像からのゲームプレイ：既存のゲーム化フレームワークとは異なり、Vision-Zeroは任意の画像からゲームを生成できるため、モデルの多様なドメインにわたる推論能力を向上させ、異なるタスクに対する強い汎化能力を示す。この汎用性を、CLEVRベースの合成シーン、チャート、実世界の画像という3つの異なるタイプの画像データセットを使用して実証する。(3) 持続的なパフォーマンス向上：我々は、セルフプレイと検証可能な報酬を用いた強化学習（RLVR）を交互に行う新しいトレーニングアルゴリズムであるIterative Self-Play Policy Optimization（Iterative-SPO）を導入し、セルフプレイのみのトレーニングでしばしば見られるパフォーマンスの停滞を緩和し、持続的な長期的改善を達成する。ラベルなしデータを使用しているにもかかわらず、Vision-Zeroは推論、チャート質問応答、視覚中心の理解タスクにおいて最先端のパフォーマンスを達成し、他のアノテーションベースの手法を凌駕している。モデルとコードはhttps://github.com/wangqinsi1/Vision-Zeroで公開されている。

English

Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.

Vision-Zero: 戦略的ゲーミフィケーション自己対戦によるスケーラブルなVLM自己改善

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

要旨

Support