ShapeGen4D：邁向基於視頻的高質量四維形狀生成

摘要

視頻條件下的四維形狀生成旨在直接從輸入視頻中恢復時變的三維幾何結構與視角一致的外觀。本研究提出了一種原生視頻至四維形狀生成框架，該框架能夠端到端地從視頻中合成單一的動態三維表示。我們的框架基於大規模預訓練的三維模型，引入了三個關鍵組件：(i) 一種時間注意力機制，該機制在生成過程中考慮所有幀，同時產生時間索引的動態表示；(ii) 一種時間感知的點採樣與四維潛在錨定技術，以促進時間上一致的幾何與紋理；(iii) 跨幀的噪聲共享，以增強時間穩定性。我們的方法無需逐幀優化，便能精確捕捉非剛性運動、體積變化乃至拓撲轉變。在多樣化的真實世界視頻中，與基線方法相比，我們的方法提升了魯棒性與感知保真度，並減少了失敗模式。

English

Video-conditioned 4D shape generation aims to recover time-varying 3D geometry and view-consistent appearance directly from an input video. In this work, we introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video. Our framework introduces three key components based on large-scale pre-trained 3D models: (i) a temporal attention that conditions generation on all frames while producing a time-indexed dynamic representation; (ii) a time-aware point sampling and 4D latent anchoring that promote temporally consistent geometry and texture; and (iii) noise sharing across frames to enhance temporal stability. Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization. Across diverse in-the-wild videos, our method improves robustness and perceptual fidelity and reduces failure modes compared with the baselines.

ShapeGen4D：邁向基於視頻的高質量四維形狀生成

ShapeGen4D: Towards High Quality 4D Shape Generation from Videos

摘要

Support