OmniJARVIS:统一的视觉-语言-动作标记化实现开放世界指令跟随代理
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
June 27, 2024
作者: Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang
cs.AI
摘要
我们提出了OmniJARVIS,这是一个新颖的视觉-语言-行动(VLA)模型,用于开放世界Minecraft中的指令跟随代理。与以往的工作相比,以前者要么向单独的控制器发出文本目标,要么直接产生控制命令不同,OmniJARVIS通过统一的多模态交互数据的标记化,寻求一条确保强大推理和高效决策能力的不同路径。首先,我们介绍了一种自监督方法,用于学习生成行为轨迹 tau = {o_0, a_0, 等} 的行为编码器,并且一个以这些标记为条件的模仿学习(IL)策略解码器。这些额外的行为标记将被增加到预训练的多模态语言模型(MLMs)的词汇表中。借助这个编码器,我们将长期的多模态交互(涉及任务说明、记忆、思考、观察、文本响应、行为轨迹等)打包成统一的标记序列,并使用自回归变压器对其进行建模。由于语义上有意义的行为标记,最终的VLA模型OmniJARVIS能够通过生成思维链进行推理、规划、回答问题,并通过为IL策略解码器生成行为标记来行动。OmniJARVIS在开放世界Minecraft中的全面原子、程序化和开放式任务集合上表现出色。我们的分析进一步揭示了交互数据形成、统一标记化及其扩展潜力中的关键设计原则。
English
We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for
open-world instruction-following agents in open-world Minecraft. Compared to
prior works that either emit textual goals to separate controllers or produce
the control command directly, OmniJARVIS seeks a different path to ensure both
strong reasoning and efficient decision-making capabilities via unified
tokenization of multimodal interaction data. First, we introduce a
self-supervised approach to learn a behavior encoder that produces discretized
tokens for behavior trajectories tau = {o_0, a_0, dots} and an
imitation learning (IL) policy decoder conditioned on these tokens. These
additional behavior tokens will be augmented to the vocabulary of pretrained
Multimodal Language Models (MLMs). With this encoder, we then pack long-term
multimodal interactions involving task instructions, memories, thoughts,
observations, textual responses, behavior trajectories, etc. into unified token
sequences and model them with autoregressive transformers. Thanks to the
semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS,
can reason (by producing chain-of-thoughts), plan, answer questions, and act
(by producing behavior tokens for the IL policy decoder). OmniJARVIS
demonstrates excellent performances on a comprehensive collection of atomic,
programmatic, and open-ended tasks in open-world Minecraft. Our analysis
further unveils the crucial design principles in interaction data formation,
unified tokenization, and its scaling potentials.Summary
AI-Generated Summary