JARVIS-VLA:訓練後的大規模視覺語言模型,以鍵盤和滑鼠玩視覺遊戲
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
March 20, 2025
作者: Muyao Li, Zihao Wang, Kaichen He, Xiaojian Ma, Yitao Liang
cs.AI
摘要
近期,开放世界环境中基于行动的决策引起了广泛关注。视觉语言行动(VLA)模型,通过大规模网络数据集预训练,在决策任务中展现出了潜力。然而,以往的研究主要集中在行动的后训练阶段,往往忽视了对基础模型本身的改进。为此,我们提出了一种新颖的方法——视觉语言后训练中的行动生成,该方法通过视觉和语言指导以自监督的方式精炼视觉语言模型(VLMs)。这一增强提升了模型在开放世界环境中的世界知识、视觉识别和空间定位能力。遵循上述后训练范式,我们获得了首个在《我的世界》中能够执行超过1000种不同原子任务(包括制作、冶炼、烹饪、采矿和击杀)的VLA模型,这些任务均能遵循人类指令。我们的实验表明,在非轨迹任务上进行后训练,相较于最佳代理基线,在多样化的原子任务集上实现了显著的40%性能提升。此外,我们证明了该方法超越了基于模仿学习的传统策略,在《我的世界》中达到了最先进的性能水平。为了促进进一步研究,我们已开源代码、模型及数据集。项目页面可访问:https://craftjarvis.github.io/JarvisVLA。
English
Recently, action-based decision-making in open-world environments has gained
significant attention. Visual Language Action (VLA) models, pretrained on
large-scale web datasets, have shown promise in decision-making tasks. However,
previous work has primarily focused on action post-training, often neglecting
enhancements to the foundational model itself. In response, we introduce a
novel approach, Act from Visual Language Post-Training, which refines Visual
Language Models (VLMs) through visual and linguistic guidance in a
self-supervised manner. This enhancement improves the models' capabilities in
world knowledge, visual recognition, and spatial grounding in open-world
environments. Following the above post-training paradigms, we obtain the first
VLA models in Minecraft that can follow human instructions on over 1k different
atomic tasks, including crafting, smelting, cooking, mining, and killing. Our
experiments demonstrate that post-training on non-trajectory tasks leads to a
significant 40% improvement over the best agent baseline on a diverse set of
atomic tasks. Furthermore, we demonstrate that our approach surpasses
traditional imitation learning-based policies in Minecraft, achieving
state-of-the-art performance. We have open-sourced the code, models, and
datasets to foster further research. The project page can be found in
https://craftjarvis.github.io/JarvisVLA.Summary
AI-Generated Summary