G1: 強化学習による視覚言語モデルの知覚・推論能力のブートストラップ

要旨

Vision-Language Models（VLM）は多くの直接的なマルチモーダルタスクにおいて優れた性能を発揮しますが、ゲームのようなインタラクティブで視覚的に豊かな環境での効果的な意思決定にその能力を活かすことは困難です。この「知っているが実行できない」ギャップは、主要なVLMが単純なゲームでも低いパフォーマンスを示すことから、自律エージェントとしての潜在能力を大きく制限しています。この問題に対処するため、我々はVLM-Gymを導入します。VLM-Gymは、多様な視覚ゲームを統一されたインターフェースと調整可能で構成可能な難易度で提供する、スケーラブルなマルチゲーム並列トレーニングに特化した強化学習（RL）環境です。VLM-Gymを活用し、純粋なRL駆動の自己進化を用いてG0モデルをトレーニングし、新たな知覚と推論パターンの出現を実証しました。さらに、ゲームの多様性に起因する課題を緩和するため、G1モデルを開発しました。G1は、RLファインチューニングの前に知覚を強化したコールドスタートを組み込んでいます。その結果、G1モデルは全てのゲームにおいて教師モデルを一貫して上回り、Claude-3.7-Sonnet-Thinkingのような主要なプロプライエタリモデルを凌駕しました。体系的な分析により、RLトレーニングプロセスを通じて知覚能力と推論能力が相互にブートストラップする興味深い発見が明らかになりました。VLM-GymとRLトレーニングを含むソースコードは、https://github.com/chenllliang/G1 で公開されており、VLMを有能なインタラクティブエージェントとして進化させるための将来の研究を促進します。

English

Vision-Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision-making within interactive, visually rich environments like games. This ``knowing-doing'' gap significantly limits their potential as autonomous agents, as leading VLMs often performing badly in simple games. To address this, we introduce VLM-Gym, a curated reinforcement learning (RL) environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty, specifically designed for scalable multi-game parallel training. Leveraging VLM-Gym, we train G0 models using pure RL-driven self-evolution, which demonstrate emergent perception and reasoning patterns. To further mitigate challenges arising from game diversity, we develop G1 models. G1 incorporates a perception-enhanced cold start prior to RL fine-tuning. Our resulting G1 models consistently surpass their teacher across all games and outperform leading proprietary models like Claude-3.7-Sonnet-Thinking. Systematic analysis reveals an intriguing finding: perception and reasoning abilities mutually bootstrap each other throughout the RL training process. Source code including VLM-Gym and RL training are released at https://github.com/chenllliang/G1 to foster future research in advancing VLMs as capable interactive agents.

G1: 強化学習による視覚言語モデルの知覚・推論能力のブートストラップ

G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

要旨

Support