FlashWorld:秒级生成高质量3D场景
FlashWorld: High-quality 3D Scene Generation within Seconds
October 15, 2025
作者: Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, Liujuan Cao
cs.AI
摘要
我们提出了FlashWorld,一种生成模型,能够从单张图像或文本提示中在数秒内生成3D场景,速度比以往工作快10至100倍,同时拥有卓越的渲染质量。我们的方法从传统的多视图导向(MV导向)范式转向3D导向方法,后者在生成多视图图像的同时直接产生3D高斯表示,用于后续的3D重建。尽管3D导向方法确保了3D一致性,但通常视觉质量较差。FlashWorld包含一个双模式预训练阶段,随后是跨模式后训练阶段,有效整合了两种范式的优势。具体而言,我们首先利用视频扩散模型的先验知识,预训练一个双模式多视图扩散模型,该模型同时支持MV导向和3D导向的生成模式。为了弥合3D导向生成中的质量差距,我们进一步提出了一种跨模式后训练蒸馏方法,通过将一致3D导向模式的分布与高质量MV导向模式相匹配。这不仅在保持3D一致性的同时提升了视觉质量,还减少了推理所需的去噪步骤。此外,我们提出了一种策略,在此过程中利用大量单视图图像和文本提示,以增强模型对分布外输入的泛化能力。大量实验证明了我们方法的优越性和效率。
English
We propose FlashWorld, a generative model that produces 3D scenes from a
single image or text prompt in seconds, 10~100times faster than previous
works while possessing superior rendering quality. Our approach shifts from the
conventional multi-view-oriented (MV-oriented) paradigm, which generates
multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach
where the model directly produces 3D Gaussian representations during multi-view
generation. While ensuring 3D consistency, 3D-oriented method typically suffers
poor visual quality. FlashWorld includes a dual-mode pre-training phase
followed by a cross-mode post-training phase, effectively integrating the
strengths of both paradigms. Specifically, leveraging the prior from a video
diffusion model, we first pre-train a dual-mode multi-view diffusion model,
which jointly supports MV-oriented and 3D-oriented generation modes. To bridge
the quality gap in 3D-oriented generation, we further propose a cross-mode
post-training distillation by matching distribution from consistent 3D-oriented
mode to high-quality MV-oriented mode. This not only enhances visual quality
while maintaining 3D consistency, but also reduces the required denoising steps
for inference. Also, we propose a strategy to leverage massive single-view
images and text prompts during this process to enhance the model's
generalization to out-of-distribution inputs. Extensive experiments demonstrate
the superiority and efficiency of our method.