ChatPaper.aiChatPaper

VisPlay:基於影像自我演進的視覺語言模型

VisPlay: Self-Evolving Vision-Language Models from Images

November 19, 2025
作者: Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, Yonghui Yang
cs.AI

摘要

強化學習(RL)為提升視覺語言模型(VLMs)在複雜推理任務上的表現提供了理論框架。然而,現有RL方法通常依賴人工標註標籤或任務專用啟發式規則來定義可驗證的獎勵,這兩種方式均成本高昂且難以擴展。我們提出VisPlay——一種自演進的RL框架,能讓VLMs利用大量未標註圖像數據自主提升推理能力。該框架從單個基礎VLM出發,將模型分配至兩個互動角色:圖像條件提問者負責構建具挑戰性但可回答的視覺問題,而多模態推理者則生成銀標答案。這些角色通過群組相對策略優化(GRPO)進行聯合訓練,該方法融合多樣性與難度獎勵機制,以平衡生成問題的複雜度與銀標答案的品質。VisPlay在Qwen2.5-VL和MiMo-VL兩大模型系列中展現高效擴展性。經八項基準測試(含MM-Vet和MMMU)驗證,該框架在視覺推理、組合泛化及幻覺抑制方面均實現持續改進,為自演進多模態智能開闢可擴展路徑。項目頁面請訪問:https://bruno686.github.io/VisPlay/
English
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated labels or task-specific heuristics to define verifiable rewards, both of which are costly and difficult to scale. We introduce VisPlay, a self-evolving RL framework that enables VLMs to autonomously improve their reasoning abilities using large amounts of unlabeled image data. Starting from a single base VLM, VisPlay assigns the model into two interacting roles: an Image-Conditioned Questioner that formulates challenging yet answerable visual questions, and a Multimodal Reasoner that generates silver responses. These roles are jointly trained with Group Relative Policy Optimization (GRPO), which incorporates diversity and difficulty rewards to balance the complexity of generated questions with the quality of the silver answers. VisPlay scales efficiently across two model families. When trained on Qwen2.5-VL and MiMo-VL, VisPlay achieves consistent improvements in visual reasoning, compositional generalization, and hallucination reduction across eight benchmarks, including MM-Vet and MMMU, demonstrating a scalable path toward self-evolving multimodal intelligence. The project page is available at https://bruno686.github.io/VisPlay/
PDF413December 2, 2025