Free4D:具備時空一致性的免調參四維場景生成
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency
March 26, 2025
作者: Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, Ziwei Liu
cs.AI
摘要
我們提出了Free4D,這是一個新穎的免調參框架,用於從單一圖像生成4D場景。現有方法要么專注於物體層面的生成,使得場景層面的生成難以實現,要么依賴於大規模多視角視頻數據集進行昂貴的訓練,由於4D場景數據的稀缺性,其泛化能力有限。相比之下,我們的關鍵洞察是從預訓練的基礎模型中提煉出一致的4D場景表示,這提供了效率和可泛化性等顯著優勢。1) 為實現這一點,我們首先使用圖像到視頻擴散模型對輸入圖像進行動畫處理,隨後進行4D幾何結構初始化。2) 為了將這一粗略結構轉化為時空一致的多視角視頻,我們設計了一種自適應引導機制,結合點引導去噪策略以確保空間一致性,以及一種新穎的潛在替換策略以保證時間連貫性。3) 為了將這些生成的觀測提升為一致的4D表示,我們提出了一種基於調製的細化方法,以減輕不一致性,同時充分利用生成的信息。最終的4D表示支持實時、可控的渲染,標誌著基於單一圖像的4D場景生成技術的重大進步。
English
We present Free4D, a novel tuning-free framework for 4D scene generation from
a single image. Existing methods either focus on object-level generation,
making scene-level generation infeasible, or rely on large-scale multi-view
video datasets for expensive training, with limited generalization ability due
to the scarcity of 4D scene data. In contrast, our key insight is to distill
pre-trained foundation models for consistent 4D scene representation, which
offers promising advantages such as efficiency and generalizability. 1) To
achieve this, we first animate the input image using image-to-video diffusion
models followed by 4D geometric structure initialization. 2) To turn this
coarse structure into spatial-temporal consistent multiview videos, we design
an adaptive guidance mechanism with a point-guided denoising strategy for
spatial consistency and a novel latent replacement strategy for temporal
coherence. 3) To lift these generated observations into consistent 4D
representation, we propose a modulation-based refinement to mitigate
inconsistencies while fully leveraging the generated information. The resulting
4D representation enables real-time, controllable rendering, marking a
significant advancement in single-image-based 4D scene generation.Summary
AI-Generated Summary