Vision-Zero: 전략적 게임화 자기대결을 통한 확장 가능한 VLM 자기개선

초록

강화 학습(RL)은 시각-언어 모델(VLMs)의 추론 능력을 효과적으로 향상시킬 수 있지만, 현재의 방법들은 여전히 수작업으로 구성하고 검증해야 하는 노동 집약적인 데이터셋에 크게 의존하고 있어 훈련 비용이 매우 높으며, 이로 인해 VLMs의 실제 배포가 제한되고 있습니다. 이러한 문제를 해결하기 위해, 우리는 임의의 이미지 쌍에서 생성된 경쟁적인 시각 게임을 통해 VLM의 자기 개선을 가능하게 하는 도메인에 구애받지 않는 프레임워크인 Vision-Zero를 제안합니다. 구체적으로, Vision-Zero는 세 가지 주요 특성을 포함합니다: (1) 전략적 자기 플레이 프레임워크: Vision-Zero는 "Who Is the Spy" 스타일의 게임에서 VLMs를 훈련시켜, 모델이 여러 역할 간에 전략적 추론과 행동을 수행하도록 합니다. 상호작용적인 게임 플레이를 통해 모델은 인간의 주석 없이도 자율적으로 훈련 데이터를 생성합니다. (2) 임의의 이미지에서의 게임 플레이: 기존의 게임화된 프레임워크와 달리, Vision-Zero는 임의의 이미지에서 게임을 생성할 수 있어, 다양한 도메인에서 모델의 추론 능력을 향상시키고 다양한 작업에 대한 강력한 일반화 능력을 보여줍니다. 우리는 CLEVR 기반의 합성 장면, 차트, 그리고 실제 세계의 이미지라는 세 가지 유형의 이미지 데이터셋을 사용하여 이러한 다용성을 입증합니다. (3) 지속 가능한 성능 향상: 우리는 자기 플레이와 검증 가능한 보상을 통한 강화 학습(RLVR)을 번갈아가며 수행하는 새로운 훈련 알고리즘인 Iterative Self-Play Policy Optimization(Iterative-SPO)을 도입하여, 자기 플레이만으로는 종종 발생하는 성능 정체를 완화하고 지속적인 장기적 개선을 달성합니다. 라벨 없는 데이터를 사용함에도 불구하고, Vision-Zero는 추론, 차트 질문 응답, 그리고 시각 중심 이해 작업에서 최첨단 성능을 달성하며, 다른 주석 기반 방법들을 능가합니다. 모델과 코드는 https://github.com/wangqinsi1/Vision-Zero에서 공개되었습니다.

English

Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.

Vision-Zero: 전략적 게임화 자기대결을 통한 확장 가능한 VLM 자기개선

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

초록

Support