G1：透過強化學習引導視覺-語言模型的感知與推理能力

摘要

視覺語言模型（VLMs）在許多直接的多模態任務中表現出色，但卻難以將這種能力轉化為在互動性強、視覺豐富的環境（如遊戲）中的有效決策。這種“知與行”的差距顯著限制了它們作為自主代理的潛力，因為領先的VLMs在簡單遊戲中往往表現不佳。為解決這一問題，我們引入了VLM-Gym，這是一個精心設計的強化學習（RL）環境，包含多樣化的視覺遊戲，具有統一的接口和可調節的組合難度，專門為可擴展的多遊戲並行訓練而設計。利用VLM-Gym，我們訓練了G0模型，這些模型通過純粹的RL驅動自我進化，展現了湧現的感知和推理模式。為了進一步應對遊戲多樣性帶來的挑戰，我們開發了G1模型。G1在RL微調之前引入了感知增強的冷啟動策略。我們最終的G1模型在所有遊戲中均超越了其教師模型，並超越了領先的專有模型如Claude-3.7-Sonnet-Thinking。系統分析揭示了一個有趣的發現：感知和推理能力在RL訓練過程中相互促進。我們在https://github.com/chenllliang/G1上發布了包括VLM-Gym和RL訓練的源代碼，以促進未來在提升VLMs作為能力互動代理方面的研究。

English

Vision-Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision-making within interactive, visually rich environments like games. This ``knowing-doing'' gap significantly limits their potential as autonomous agents, as leading VLMs often performing badly in simple games. To address this, we introduce VLM-Gym, a curated reinforcement learning (RL) environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty, specifically designed for scalable multi-game parallel training. Leveraging VLM-Gym, we train G0 models using pure RL-driven self-evolution, which demonstrate emergent perception and reasoning patterns. To further mitigate challenges arising from game diversity, we develop G1 models. G1 incorporates a perception-enhanced cold start prior to RL fine-tuning. Our resulting G1 models consistently surpass their teacher across all games and outperform leading proprietary models like Claude-3.7-Sonnet-Thinking. Systematic analysis reveals an intriguing finding: perception and reasoning abilities mutually bootstrap each other throughout the RL training process. Source code including VLM-Gym and RL training are released at https://github.com/chenllliang/G1 to foster future research in advancing VLMs as capable interactive agents.

G1：透過強化學習引導視覺-語言模型的感知與推理能力

G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

摘要

Support