ChatPaper.aiChatPaper

OmniJARVIS:统一的视觉-语言-动作标记化实现开放世界指令跟随代理

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

June 27, 2024
作者: Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian Ma, Yitao Liang
cs.AI

摘要

我们提出了OmniJARVIS,这是一个新颖的视觉-语言-行动(VLA)模型,用于开放世界Minecraft中的指令跟随代理。与以往的工作相比,以前者要么向单独的控制器发出文本目标,要么直接产生控制命令不同,OmniJARVIS通过统一的多模态交互数据的标记化,寻求一条确保强大推理和高效决策能力的不同路径。首先,我们介绍了一种自监督方法,用于学习生成行为轨迹 tau = {o_0, a_0, 等} 的行为编码器,并且一个以这些标记为条件的模仿学习(IL)策略解码器。这些额外的行为标记将被增加到预训练的多模态语言模型(MLMs)的词汇表中。借助这个编码器,我们将长期的多模态交互(涉及任务说明、记忆、思考、观察、文本响应、行为轨迹等)打包成统一的标记序列,并使用自回归变压器对其进行建模。由于语义上有意义的行为标记,最终的VLA模型OmniJARVIS能够通过生成思维链进行推理、规划、回答问题,并通过为IL策略解码器生成行为标记来行动。OmniJARVIS在开放世界Minecraft中的全面原子、程序化和开放式任务集合上表现出色。我们的分析进一步揭示了交互数据形成、统一标记化及其扩展潜力中的关键设计原则。
English
We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories tau = {o_0, a_0, dots} and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models (MLMs). With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc. into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the IL policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials.

Summary

AI-Generated Summary

PDF135November 28, 2024