長期空間メモリを備えたビデオ世界モデル

要旨

新興の世界モデルは、カメラの動きやテキストプロンプトなどの制御信号に応答して、オートリグレッシブにビデオフレームを生成する。しかし、時間的なコンテキストウィンドウのサイズが限られているため、これらのモデルは再訪時にシーンの一貫性を維持するのに苦労し、以前に生成された環境を深刻に忘れてしまうことが多い。人間の記憶メカニズムに着想を得て、幾何学的に基づいた長期的空間記憶を通じてビデオ世界モデルの長期的な一貫性を向上させる新しいフレームワークを提案する。本フレームワークには、長期的空間記憶から情報を保存および検索するメカニズムが含まれており、明示的に3D記憶メカニズムを備えた世界モデルを訓練および評価するためのカスタムデータセットをキュレーションする。評価の結果、関連するベースラインと比較して品質、一貫性、およびコンテキスト長が向上し、長期的に一貫した世界生成への道を開くことが示された。

English

Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.

長期空間メモリを備えたビデオ世界モデル

Video World Models with Long-term Spatial Memory

要旨

Support