JARVIS-VLA：通过键盘与鼠标操控视觉游戏的后训练大规模视觉语言模型

摘要

近期，开放世界环境中基于行动的决策研究引起了广泛关注。视觉语言行动（VLA）模型，通过大规模网络数据集预训练，在决策任务中展现出潜力。然而，以往的研究多集中于行动的后训练阶段，往往忽视了对基础模型本身的改进。为此，我们提出了一种创新方法——视觉语言后训练行动（Act from Visual Language Post-Training），该方法通过视觉与语言的自我监督指导，精炼视觉语言模型（VLMs），从而提升模型在开放世界环境中的世界知识理解、视觉识别及空间定位能力。遵循上述后训练范式，我们首次在《我的世界》中实现了能够执行超过1000种不同原子任务（如制作、冶炼、烹饪、采矿和击杀）的VLA模型，这些模型能够遵循人类指令。实验表明，在非轨迹任务上的后训练，相较于最佳代理基线，在多样化的原子任务集上实现了40%的显著提升。此外，我们的方法超越了基于模仿学习的传统策略，在《我的世界》中达到了业界领先的性能。我们已开源代码、模型及数据集，以促进进一步研究。项目页面详见https://craftjarvis.github.io/JarvisVLA。

English

Recently, action-based decision-making in open-world environments has gained significant attention. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundational model itself. In response, we introduce a novel approach, Act from Visual Language Post-Training, which refines Visual Language Models (VLMs) through visual and linguistic guidance in a self-supervised manner. This enhancement improves the models' capabilities in world knowledge, visual recognition, and spatial grounding in open-world environments. Following the above post-training paradigms, we obtain the first VLA models in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that post-training on non-trajectory tasks leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, we demonstrate that our approach surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance. We have open-sourced the code, models, and datasets to foster further research. The project page can be found in https://craftjarvis.github.io/JarvisVLA.

JARVIS-VLA：通过键盘与鼠标操控视觉游戏的后训练大规模视觉语言模型

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

摘要

Support