G1: 강화 학습을 통한 시각-언어 모델의 지각 및 추론 능력 부트스트래핑

초록

비전-언어 모델(VLMs)은 다양한 직접적인 멀티모달 작업에서 뛰어난 성능을 보이지만, 게임과 같은 상호작용적이고 시각적으로 풍부한 환경에서의 효과적인 의사결정으로 이어지지는 못합니다. 이러한 "아는 것과 실행하는 것"의 간극은 자율 에이전트로서의 잠재력을 크게 제한하며, 주요 VLM들이 간단한 게임에서도 낮은 성능을 보이는 원인이 됩니다. 이를 해결하기 위해, 우리는 VLM-Gym을 소개합니다. VLM-Gym은 다양한 시각적 게임을 통합된 인터페이스와 조정 가능한 구성적 난이도로 제공하며, 확장 가능한 다중 게임 병렬 훈련을 위해 특별히 설계된 강화학습(RL) 환경입니다. VLM-Gym을 활용하여, 우리는 순수 RL 기반 자기 진화를 통해 G0 모델을 훈련시켰으며, 이 모델은 새로운 지각 및 추론 패턴을 보여주었습니다. 게임 다양성으로 인한 문제를 더욱 완화하기 위해, 우리는 G1 모델을 개발했습니다. G1은 RL 미세 조정 전에 지각 강화된 콜드 스타트를 포함합니다. 결과적으로, G1 모델은 모든 게임에서 교사 모델을 일관되게 능가하며, Claude-3.7-Sonnet-Thinking와 같은 주요 상용 모델을 뛰어넘는 성능을 보였습니다. 체계적인 분석을 통해, RL 훈련 과정에서 지각과 추론 능력이 서로를 부트스트랩하는 흥미로운 발견을 확인했습니다. VLM-Gym 및 RL 훈련을 포함한 소스 코드는 https://github.com/chenllliang/G1에서 공개되어, VLM을 능동적인 상호작용 에이전트로 발전시키는 미래 연구를 촉진하고자 합니다.

English

Vision-Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision-making within interactive, visually rich environments like games. This ``knowing-doing'' gap significantly limits their potential as autonomous agents, as leading VLMs often performing badly in simple games. To address this, we introduce VLM-Gym, a curated reinforcement learning (RL) environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty, specifically designed for scalable multi-game parallel training. Leveraging VLM-Gym, we train G0 models using pure RL-driven self-evolution, which demonstrate emergent perception and reasoning patterns. To further mitigate challenges arising from game diversity, we develop G1 models. G1 incorporates a perception-enhanced cold start prior to RL fine-tuning. Our resulting G1 models consistently surpass their teacher across all games and outperform leading proprietary models like Claude-3.7-Sonnet-Thinking. Systematic analysis reveals an intriguing finding: perception and reasoning abilities mutually bootstrap each other throughout the RL training process. Source code including VLM-Gym and RL training are released at https://github.com/chenllliang/G1 to foster future research in advancing VLMs as capable interactive agents.

G1: 강화 학습을 통한 시각-언어 모델의 지각 및 추론 능력 부트스트래핑

G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

초록

Support