MoVerse: パノラマガウシアンスキャフォールドを用いたリアルタイム動画世界モデリング

要旨

我々はMoVerseを提案する。これは単一の狭視野画像から対話的にナビゲート可能なシーンを生成するリアルタイムビデオワールドモデルである。この設定は、入力が環境のごく一部しか観測しない一方で、インタラクティブなローミングには完全な周囲の世界、永続的な幾何構造、制御可能なカメラ動作、そして時間的に一貫した高忠実な観測が必要となるため、困難である。MoVerseは、ワールド構築と観測レンダリングを分離することでこの問題に取り組む。まず、入力画像をトポロジーを考慮した拡散により重力方向に整列された360°パノラマに拡張し、3D推論の前に欠損した視野を補完する。次に、パノラマ幾何学を考慮した残差予測を用いてパノラマを永続的な3Dガウシアンスキャフォールドに変換し、密で直接レンダリング可能な空間メモリを生成する。最後に、ガウシアン条件付きビデオレンダラーが、ユーザ指定のカメラ軌跡に沿ったスキャフォールドのレンダリング結果をフォトリアリスティックなビデオに変換する。このレンダラーをインタラクションに実用的にするために、高品質な条件付きレンダリングのための双方向拡散ティーチャーを訓練し、これを因果的自己回帰型スチューデントに蒸留して、レイテンシ制限付きストリーミングを実現する。この設計は、明示的な3D表現の制御可能性と長距離一貫性を、生成ビデオモデルの知覚品質と組み合わせる。MoVerseは、単一のNVIDIA RTX 4090 GPU上で8 FPSのリアルタイムシーンローミングをサポートし、インタラクティブなビデオ出力による単一画像からのワールド生成への実用的な道筋を示す。

English

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360^circ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.