JARVIS-VLA: 키보드와 마우스를 사용하여 시각 게임을 플레이하기 위한 대규모 시각-언어 모델의 사후 학습

초록

최근, 오픈 월드 환경에서의 행동 기반 의사 결정이 상당한 주목을 받고 있습니다. 대규모 웹 데이터셋으로 사전 학습된 시각 언어 행동(Visual Language Action, VLA) 모델들은 의사 결정 작업에서 유망한 성과를 보여주었습니다. 그러나 기존 연구는 주로 사후 학습 단계의 행동에 초점을 맞추어, 기본 모델 자체의 개선을 소홀히 해왔습니다. 이에 대응하여, 우리는 시각 언어 모델(Visual Language Models, VLMs)을 시각적 및 언어적 지도를 통해 자기 지도 방식으로 개선하는 새로운 접근법인 "Act from Visual Language Post-Training"을 제안합니다. 이 개선은 모델의 세계 지식, 시각 인식, 그리고 오픈 월드 환경에서의 공간적 기반 능력을 향상시킵니다. 이러한 사후 학습 패러다임을 따라, 우리는 Minecraft에서 1,000개 이상의 다양한 원자적 작업(예: 제작, 용광로 작업, 요리, 채굴, 몬스터 처치 등)에 대해 인간의 지시를 따를 수 있는 최초의 VLA 모델을 얻었습니다. 우리의 실험은 비-궤적 작업에 대한 사후 학습이 다양한 원자적 작업에서 최고의 에이전트 기준선보다 40%의 상당한 성능 향상을 가져온다는 것을 보여줍니다. 더 나아가, 우리의 접근법이 Minecraft에서 전통적인 모방 학습 기반 정책을 능가하며 최첨단 성능을 달성한다는 것을 입증합니다. 우리는 코드, 모델, 데이터셋을 공개하여 추가 연구를 촉진하고자 합니다. 프로젝트 페이지는 https://craftjarvis.github.io/JarvisVLA에서 확인할 수 있습니다.

English

Recently, action-based decision-making in open-world environments has gained significant attention. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models' capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, we demonstrate that our approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance. We have open-sourced the code, models, and datasets to foster further research. The project page can be found in https://craftjarvis.github.io/JarvisVLA.

JARVIS-VLA: 키보드와 마우스를 사용하여 시각 게임을 플레이하기 위한 대규모 시각-언어 모델의 사후 학습

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

초록

Support