ChatPaper.aiChatPaper

G1:通过强化学习引导视觉语言模型的感知与推理能力

G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

May 19, 2025
作者: Liang Chen, Hongcheng Gao, Tianyu Liu, Zhiqi Huang, Flood Sung, Xinyu Zhou, Yuxin Wu, Baobao Chang
cs.AI

摘要

视觉-语言模型(VLMs)在多项直接的多模态任务中表现出色,但在将这种能力转化为互动性强、视觉丰富的环境(如游戏)中的有效决策时却面临困难。这种“知与行”的差距显著限制了它们作为自主代理的潜力,顶尖的VLMs在简单游戏中往往表现不佳。为解决这一问题,我们推出了VLM-Gym,这是一个精心设计的强化学习(RL)环境,包含多样化的视觉游戏,具备统一的接口和可调节、组合的难度,专为可扩展的多游戏并行训练而打造。借助VLM-Gym,我们训练了G0模型,采用纯RL驱动的自我进化,展现了感知与推理能力的涌现模式。为进一步应对游戏多样性带来的挑战,我们开发了G1模型。G1在RL微调前引入了感知增强的冷启动策略。最终,我们的G1模型在所有游戏中均超越了其导师,并超越了如Claude-3.7-Sonnet-Thinking等领先的专有模型。系统分析揭示了一个有趣的发现:在RL训练过程中,感知与推理能力相互促进,共同提升。我们已发布包含VLM-Gym和RL训练的源代码于https://github.com/chenllliang/G1,以促进未来研究,推动VLMs成为更强大的互动代理。
English
Vision-Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision-making within interactive, visually rich environments like games. This ``knowing-doing'' gap significantly limits their potential as autonomous agents, as leading VLMs often performing badly in simple games. To address this, we introduce VLM-Gym, a curated reinforcement learning (RL) environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty, specifically designed for scalable multi-game parallel training. Leveraging VLM-Gym, we train G0 models using pure RL-driven self-evolution, which demonstrate emergent perception and reasoning patterns. To further mitigate challenges arising from game diversity, we develop G1 models. G1 incorporates a perception-enhanced cold start prior to RL fine-tuning. Our resulting G1 models consistently surpass their teacher across all games and outperform leading proprietary models like Claude-3.7-Sonnet-Thinking. Systematic analysis reveals an intriguing finding: perception and reasoning abilities mutually bootstrap each other throughout the RL training process. Source code including VLM-Gym and RL training are released at https://github.com/chenllliang/G1 to foster future research in advancing VLMs as capable interactive agents.

Summary

AI-Generated Summary

PDF112May 27, 2025