G1:透過強化學習引導視覺-語言模型的感知與推理能力
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning
May 19, 2025
作者: Liang Chen, Hongcheng Gao, Tianyu Liu, Zhiqi Huang, Flood Sung, Xinyu Zhou, Yuxin Wu, Baobao Chang
cs.AI
摘要
視覺語言模型(VLMs)在許多直接的多模態任務中表現出色,但卻難以將這種能力轉化為在互動性強、視覺豐富的環境(如遊戲)中的有效決策。這種“知與行”的差距顯著限制了它們作為自主代理的潛力,因為領先的VLMs在簡單遊戲中往往表現不佳。為解決這一問題,我們引入了VLM-Gym,這是一個精心設計的強化學習(RL)環境,包含多樣化的視覺遊戲,具有統一的接口和可調節的組合難度,專門為可擴展的多遊戲並行訓練而設計。利用VLM-Gym,我們訓練了G0模型,這些模型通過純粹的RL驅動自我進化,展現了湧現的感知和推理模式。為了進一步應對遊戲多樣性帶來的挑戰,我們開發了G1模型。G1在RL微調之前引入了感知增強的冷啟動策略。我們最終的G1模型在所有遊戲中均超越了其教師模型,並超越了領先的專有模型如Claude-3.7-Sonnet-Thinking。系統分析揭示了一個有趣的發現:感知和推理能力在RL訓練過程中相互促進。我們在https://github.com/chenllliang/G1上發布了包括VLM-Gym和RL訓練的源代碼,以促進未來在提升VLMs作為能力互動代理方面的研究。
English
Vision-Language Models (VLMs) excel in many direct multimodal tasks but
struggle to translate this prowess into effective decision-making within
interactive, visually rich environments like games. This ``knowing-doing'' gap
significantly limits their potential as autonomous agents, as leading VLMs
often performing badly in simple games. To address this, we introduce VLM-Gym,
a curated reinforcement learning (RL) environment featuring diverse visual
games with unified interfaces and adjustable, compositional difficulty,
specifically designed for scalable multi-game parallel training. Leveraging
VLM-Gym, we train G0 models using pure RL-driven self-evolution, which
demonstrate emergent perception and reasoning patterns. To further mitigate
challenges arising from game diversity, we develop G1 models. G1 incorporates a
perception-enhanced cold start prior to RL fine-tuning. Our resulting G1 models
consistently surpass their teacher across all games and outperform leading
proprietary models like Claude-3.7-Sonnet-Thinking. Systematic analysis reveals
an intriguing finding: perception and reasoning abilities mutually bootstrap
each other throughout the RL training process. Source code including VLM-Gym
and RL training are released at https://github.com/chenllliang/G1 to foster
future research in advancing VLMs as capable interactive agents.Summary
AI-Generated Summary