4DNeX：簡化前饋式四維生成建模

摘要

我們提出了4DNeX，這是首個從單一圖像生成4D（即動態3D）場景表徵的前饋框架。與現有依賴計算密集型優化或需要多幀視頻輸入的方法不同，4DNeX通過微調預訓練的視頻擴散模型，實現了高效的端到端圖像到4D生成。具體而言，1）為緩解4D數據的稀缺性，我們構建了4DNeX-10M，這是一個利用先進重建方法生成高質量4D註釋的大規模數據集。2）我們引入了一種統一的6D視頻表徵，聯合建模RGB和XYZ序列，促進外觀和幾何的結構化學習。3）我們提出了一系列簡單而有效的適應策略，將預訓練的視頻擴散模型重新用於4D建模。4DNeX生成的高質量動態點雲支持新視角視頻合成。大量實驗表明，4DNeX在效率和泛化能力上優於現有的4D生成方法，為圖像到4D建模提供了一個可擴展的解決方案，並為模擬動態場景演化的生成式4D世界模型奠定了基礎。

English

We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.

4DNeX：簡化前饋式四維生成建模

4DNeX: Feed-Forward 4D Generative Modeling Made Easy

摘要

Support