MineWorld: Minecraft上でのリアルタイム・オープンソース型インタラクティブワールドモデル

要旨

世界モデリングは、知的エージェントが人間と効果的に相互作用し、動的な環境で動作するために不可欠なタスクです。本研究では、世界モデリングの共通テストベッドとして利用されてきたオープンエンドのサンドボックスゲームであるMinecraft上で、リアルタイムにインタラクティブな世界モデルであるMineWorldを提案します。MineWorldは、視覚-行動オートリグレッシブTransformerによって駆動され、ペアになったゲームシーンと対応する行動を入力として受け取り、その行動に続く新しいシーンを生成します。具体的には、画像トークナイザーと行動トークナイザーを使用して視覚的なゲームシーンと行動を離散的なトークンIDに変換し、これら2種類のIDを交互に連結してモデル入力を構成します。モデルは、次のトークン予測を通じて、ゲーム状態の豊かな表現と、状態と行動の間の条件を同時に学習するように訓練されます。推論時には、各フレームの空間的に冗長なトークンを同時に予測する新しい並列デコードアルゴリズムを開発し、異なるスケールのモデルが1秒間に4～7フレームを生成し、ゲームプレイヤーとのリアルタイムなインタラクションを可能にします。評価では、新しいシーンを生成する際の視覚的な品質だけでなく、世界モデルにとって重要な行動追従能力を評価するための新しい指標を提案します。我々の包括的な評価は、MineWorldの有効性を示し、SoTAのオープンソースの拡散ベースの世界モデルを大幅に上回る性能を発揮します。コードとモデルは公開されています。

English

World modeling is a crucial task for enabling intelligent agents to effectively interact with humans and operate in dynamic environments. In this work, we propose MineWorld, a real-time interactive world model on Minecraft, an open-ended sandbox game which has been utilized as a common testbed for world modeling. MineWorld is driven by a visual-action autoregressive Transformer, which takes paired game scenes and corresponding actions as input, and generates consequent new scenes following the actions. Specifically, by transforming visual game scenes and actions into discrete token ids with an image tokenizer and an action tokenizer correspondingly, we consist the model input with the concatenation of the two kinds of ids interleaved. The model is then trained with next token prediction to learn rich representations of game states as well as the conditions between states and actions simultaneously. In inference, we develop a novel parallel decoding algorithm that predicts the spatial redundant tokens in each frame at the same time, letting models in different scales generate 4 to 7 frames per second and enabling real-time interactions with game players. In evaluation, we propose new metrics to assess not only visual quality but also the action following capacity when generating new scenes, which is crucial for a world model. Our comprehensive evaluation shows the efficacy of MineWorld, outperforming SoTA open-sourced diffusion based world models significantly. The code and model have been released.

MineWorld: Minecraft上でのリアルタイム・オープンソース型インタラクティブワールドモデル

MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

要旨

Support