MineWorld:一個基於《我的世界》的即時開源互動世界模型
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft
April 11, 2025
作者: Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, Jiang Bian
cs.AI
摘要
世界建模是使智能代理能夠有效與人類互動並在動態環境中運作的關鍵任務。在本研究中,我們提出了MineWorld,這是一個基於Minecraft的實時互動世界模型。Minecraft作為一個開放式沙盒遊戲,已被廣泛用作世界建模的通用測試平台。MineWorld由一個視覺-動作自回歸Transformer驅動,該模型以配對的遊戲場景和相應的動作為輸入,並根據這些動作生成後續的新場景。具體而言,通過使用圖像標記器和動作標記器分別將視覺遊戲場景和動作轉換為離散的標記ID,我們將這兩種ID交錯拼接以構成模型輸入。模型隨後通過下一個標記預測進行訓練,以同時學習遊戲狀態的豐富表示以及狀態與動作之間的條件關係。在推理階段,我們開發了一種新穎的並行解碼算法,該算法同時預測每幀中的空間冗餘標記,使得不同規模的模型每秒能夠生成4到7幀,從而實現與遊戲玩家的實時互動。在評估中,我們提出了新的指標,不僅評估視覺質量,還評估生成新場景時跟隨動作的能力,這對於世界模型至關重要。我們的全面評估顯示了MineWorld的有效性,顯著超越了基於擴散的最先進開源世界模型。代碼和模型已公開發布。
English
World modeling is a crucial task for enabling intelligent agents to
effectively interact with humans and operate in dynamic environments. In this
work, we propose MineWorld, a real-time interactive world model on Minecraft,
an open-ended sandbox game which has been utilized as a common testbed for
world modeling. MineWorld is driven by a visual-action autoregressive
Transformer, which takes paired game scenes and corresponding actions as input,
and generates consequent new scenes following the actions. Specifically, by
transforming visual game scenes and actions into discrete token ids with an
image tokenizer and an action tokenizer correspondingly, we consist the model
input with the concatenation of the two kinds of ids interleaved. The model is
then trained with next token prediction to learn rich representations of game
states as well as the conditions between states and actions simultaneously. In
inference, we develop a novel parallel decoding algorithm that predicts the
spatial redundant tokens in each frame at the same time, letting models in
different scales generate 4 to 7 frames per second and enabling real-time
interactions with game players. In evaluation, we propose new metrics to assess
not only visual quality but also the action following capacity when generating
new scenes, which is crucial for a world model. Our comprehensive evaluation
shows the efficacy of MineWorld, outperforming SoTA open-sourced diffusion
based world models significantly. The code and model have been released.Summary
AI-Generated Summary