MoVieS: 1秒でのモーション認識4D動的ビュー合成

要旨

本論文では、MoVieSという新しいフィードフォワードモデルを提案します。MoVieSは、単眼動画から4次元の動的新規視点を1秒で合成します。MoVieSは、動的3Dシーンをガウシアンプリミティブのピクセルアラインドグリッドで表現し、その時間変化する動きを明示的に監督します。これにより、初めて、外観、幾何学、動きの統一的なモデリングが可能となり、単一の学習ベースのフレームワーク内で視点合成、再構成、3Dポイントトラッキングを実現します。新規視点合成と動的幾何学再構成を橋渡しすることで、MoVieSは、タスク固有の監督に最小限の依存で多様なデータセットでの大規模な学習を可能にします。その結果、シーンフロー推定や移動物体セグメンテーションなど、幅広いゼロショットアプリケーションを自然にサポートします。広範な実験により、MoVieSの有効性と効率性が複数のタスクにわたって検証され、競争力のある性能を達成しながら、数桁の速度向上を提供することが示されました。

English

We present MoVieS, a novel feed-forward model that synthesizes 4D dynamic novel views from monocular videos in one second. MoVieS represents dynamic 3D scenes using pixel-aligned grids of Gaussian primitives, explicitly supervising their time-varying motion. This allows, for the first time, the unified modeling of appearance, geometry and motion, and enables view synthesis, reconstruction and 3D point tracking within a single learning-based framework. By bridging novel view synthesis with dynamic geometry reconstruction, MoVieS enables large-scale training on diverse datasets with minimal dependence on task-specific supervision. As a result, it also naturally supports a wide range of zero-shot applications, such as scene flow estimation and moving object segmentation. Extensive experiments validate the effectiveness and efficiency of MoVieS across multiple tasks, achieving competitive performance while offering several orders of magnitude speedups.

MoVieS: 1秒でのモーション認識4D動的ビュー合成

MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second

要旨

Support