Free4D：具備時空一致性的免調參四維場景生成

摘要

我們提出了Free4D，這是一個新穎的免調參框架，用於從單一圖像生成4D場景。現有方法要么專注於物體層面的生成，使得場景層面的生成難以實現，要么依賴於大規模多視角視頻數據集進行昂貴的訓練，由於4D場景數據的稀缺性，其泛化能力有限。相比之下，我們的關鍵洞察是從預訓練的基礎模型中提煉出一致的4D場景表示，這提供了效率和可泛化性等顯著優勢。1) 為實現這一點，我們首先使用圖像到視頻擴散模型對輸入圖像進行動畫處理，隨後進行4D幾何結構初始化。2) 為了將這一粗略結構轉化為時空一致的多視角視頻，我們設計了一種自適應引導機制，結合點引導去噪策略以確保空間一致性，以及一種新穎的潛在替換策略以保證時間連貫性。3) 為了將這些生成的觀測提升為一致的4D表示，我們提出了一種基於調製的細化方法，以減輕不一致性，同時充分利用生成的信息。最終的4D表示支持實時、可控的渲染，標誌著基於單一圖像的4D場景生成技術的重大進步。

English

We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.

Free4D：具備時空一致性的免調參四維場景生成

Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

摘要

Support