OmniJARVIS：統一的視覺-語言-動作標記化技術實現開放世界指示跟隨智能體

摘要

我們提出了 OmniJARVIS，一種新穎的視覺-語言-動作（VLA）模型，用於開放世界 Minecraft 中的開放世界指示跟隨代理。與先前的作品相比，先前的作品要麼將文本目標發送給獨立控制器，要麼直接生成控制命令，OmniJARVIS 通過對多模態交互數據進行統一標記化，尋求一條不同的道路，以確保強大的推理和高效的決策能力。首先，我們介紹了一種自監督方法，用於學習一個行為編碼器，該編碼器為行為軌跡 tau = {o_0, a_0, dots} 生成離散化標記，以及一個條件化於這些標記的模仿學習（IL）策略解碼器。這些額外的行為標記將被增加到預訓練的多模態語言模型（MLMs）的詞彙表中。通過這個編碼器，我們將涉及任務指示、記憶、思維、觀察、文本響應、行為軌跡等的長期多模態交互打包成統一的標記序列，並使用自回歸變壓器對其進行建模。由於具有語義意義的行為標記，最終的 VLA 模型 OmniJARVIS 能夠進行推理（生成思維鏈）、規劃、回答問題，並採取行動（為 IL 策略解碼器生成行為標記）。OmniJARVIS 在開放世界 Minecraft 中的全面原子、程序化和開放式任務集合上展現出優異的表現。我們的分析進一步揭示了交互數據形成、統一標記化及其擴展潛力中的關鍵設計原則。

English

We present OmniJARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in open-world Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, OmniJARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. First, we introduce a self-supervised approach to learn a behavior encoder that produces discretized tokens for behavior trajectories tau = {o_0, a_0, dots} and an imitation learning (IL) policy decoder conditioned on these tokens. These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models (MLMs). With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc. into unified token sequences and model them with autoregressive transformers. Thanks to the semantically meaningful behavior tokens, the resulting VLA model, OmniJARVIS, can reason (by producing chain-of-thoughts), plan, answer questions, and act (by producing behavior tokens for the IL policy decoder). OmniJARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials.

OmniJARVIS：統一的視覺-語言-動作標記化技術實現開放世界指示跟隨智能體

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

摘要

Support