WORLDMEM: メモリを用いた長期一貫性のある世界シミュレーション

要旨

世界シミュレーションは、仮想環境をモデル化し、行動の結果を予測する能力から、近年注目を集めています。しかし、限られた時間的文脈ウィンドウは、特に3D空間の一貫性を維持する際に、長期的な整合性の欠如を引き起こすことがあります。本研究では、WorldMemを提案します。これは、メモリフレームと状態（例えば、ポーズやタイムスタンプ）を保存するメモリユニットからなるメモリバンクを用いてシーン生成を強化するフレームワークです。これらのメモリフレームから状態に基づいて関連情報を効果的に抽出するメモリアテンションメカニズムを採用することで、本手法は、視点や時間的な隔たりが大きい場合でも、以前に観測されたシーンを正確に再構築することが可能です。さらに、状態にタイムスタンプを組み込むことで、本フレームワークは静的な世界をモデル化するだけでなく、時間の経過に伴う動的な進化も捉え、シミュレートされた世界内での知覚と相互作用を可能にします。仮想および現実のシナリオにおける広範な実験により、本アプローチの有効性が検証されています。

English

World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.

WORLDMEM: メモリを用いた長期一貫性のある世界シミュレーション

WORLDMEM: Long-term Consistent World Simulation with Memory

要旨

Support