JARVIS-1:具有記憶增強多模式語言模型的開放世界多任務代理
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models
November 10, 2023
作者: Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, Yitao Liang
cs.AI
摘要
在開放世界中達到類人般的規劃和控制,對於更功能性的通用型智能體來說是一個重要里程碑。現有方法可以處理開放世界中的某些長視程任務。然而,當開放世界任務的數量可能是無限的時候,它們仍然面臨困難,並且缺乏隨著遊戲時間推移逐步增強任務完成能力的能力。我們介紹了 JARVIS-1,一個能夠感知多模態輸入(視覺觀察和人類指令)、生成複雜計劃並執行體現控制的開放世界智能體,全部在流行但具有挑戰性的開放世界 Minecraft 宇宙中。具體來說,我們在預訓練的多模態語言模型之上開發了 JARVIS-1,該模型將視覺觀察和文本指令映射到計劃中。這些計劃最終將被發送到目標條件控制器。我們為 JARVIS-1 配備了多模態記憶,這有助於使用預訓練知識和實際遊戲生存經驗進行規劃。在我們的實驗中,JARVIS-1 在 Minecraft 宇宙基準測試中展示了幾乎完美的表現,涵蓋了超過 200 個不同難度的任務,從入門級到中級。JARVIS-1 在長視程鑽石鎬任務中實現了 12.5% 的完成率。這相較於先前記錄增加了多達 5 倍,代表了一個顯著的提升。此外,我們展示了 JARVIS-1 能夠遵循終身學習範式自我改進,這得益於多模態記憶,激發了更廣泛的智能和改進的自主性。項目頁面可在 https://craftjarvis-jarvis1.github.io 上找到。
English
Achieving human-like planning and control with multimodal observations in an
open world is a key milestone for more functional generalist agents. Existing
approaches can handle certain long-horizon tasks in an open world. However,
they still struggle when the number of open-world tasks could potentially be
infinite and lack the capability to progressively enhance task completion as
game time progresses. We introduce JARVIS-1, an open-world agent that can
perceive multimodal input (visual observations and human instructions),
generate sophisticated plans, and perform embodied control, all within the
popular yet challenging open-world Minecraft universe. Specifically, we develop
JARVIS-1 on top of pre-trained multimodal language models, which map visual
observations and textual instructions to plans. The plans will be ultimately
dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a
multimodal memory, which facilitates planning using both pre-trained knowledge
and its actual game survival experiences. In our experiments, JARVIS-1 exhibits
nearly perfect performances across over 200 varying tasks from the Minecraft
Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has
achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task.
This represents a significant increase up to 5 times compared to previous
records. Furthermore, we show that JARVIS-1 is able to self-improve
following a life-long learning paradigm thanks to multimodal memory, sparking a
more general intelligence and improved autonomy. The project page is available
at https://craftjarvis-jarvis1.github.io.