ShapeGen4D: ビデオからの高品質4D形状生成に向けて

要旨

ビデオ条件付き4D形状生成は、入力ビデオから直接、時間的に変化する3Dジオメトリと視点整合性のある外観を復元することを目的としています。本研究では、ビデオから単一の動的3D表現をエンドツーエンドで合成するネイティブなビデオ-to-4D形状生成フレームワークを提案します。我々のフレームワークは、大規模な事前学習済み3Dモデルに基づく3つの主要なコンポーネントを導入します：(i) すべてのフレームに基づいて生成を条件付けながら、時間インデックス付きの動的表現を生成する時間的注意機構、(ii) 時間的に一貫したジオメトリとテクスチャを促進する時間認識ポイントサンプリングと4D潜在アンカリング、(iii) 時間的安定性を向上させるためのフレーム間でのノイズ共有。我々の手法は、非剛体運動、体積変化、さらには位相的遷移を正確に捉え、フレームごとの最適化を必要としません。多様な実世界のビデオにおいて、我々の手法はベースラインと比較して堅牢性と知覚的忠実度を向上させ、失敗モードを減少させます。

English

Video-conditioned 4D shape generation aims to recover time-varying 3D geometry and view-consistent appearance directly from an input video. In this work, we introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video. Our framework introduces three key components based on large-scale pre-trained 3D models: (i) a temporal attention that conditions generation on all frames while producing a time-indexed dynamic representation; (ii) a time-aware point sampling and 4D latent anchoring that promote temporally consistent geometry and texture; and (iii) noise sharing across frames to enhance temporal stability. Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization. Across diverse in-the-wild videos, our method improves robustness and perceptual fidelity and reduces failure modes compared with the baselines.

ShapeGen4D: ビデオからの高品質4D形状生成に向けて

ShapeGen4D: Towards High Quality 4D Shape Generation from Videos

要旨

Support