4DNeX：简化前馈式四维生成建模

摘要

我们提出了4DNeX，这是首个从单张图像生成4D（即动态3D）场景表征的前馈框架。与现有方法依赖计算密集型的优化或需要多帧视频输入不同，4DNeX通过微调预训练的视频扩散模型，实现了高效的端到端图像到4D生成。具体而言：1）为缓解4D数据稀缺问题，我们构建了4DNeX-10M，这是一个利用先进重建方法生成高质量4D标注的大规模数据集；2）我们引入了一种统一的6D视频表征，联合建模RGB和XYZ序列，促进外观与几何的结构化学习；3）我们提出了一系列简单而有效的适应策略，将预训练的视频扩散模型重新用于4D建模。4DNeX生成的高质量动态点云支持新视角视频合成。大量实验表明，4DNeX在效率和泛化能力上均优于现有的4D生成方法，为图像到4D建模提供了可扩展的解决方案，并为模拟动态场景演化的生成式4D世界模型奠定了基础。

English

We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.

4DNeX：简化前馈式四维生成建模

4DNeX: Feed-Forward 4D Generative Modeling Made Easy

摘要

Support