FlashWorld:秒級生成高品質3D場景
FlashWorld: High-quality 3D Scene Generation within Seconds
October 15, 2025
作者: Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, Liujuan Cao
cs.AI
摘要
我們提出了FlashWorld,這是一個生成模型,能夠在幾秒鐘內從單一圖像或文本提示生成3D場景,速度比之前的工作快10到100倍,同時擁有更優越的渲染質量。我們的方法從傳統的多視角導向(MV導向)範式轉變,該範式生成多視角圖像以進行後續的3D重建,轉向一種3D導向的方法,在生成多視角圖像的同時,模型直接生成3D高斯表示。雖然確保了3D一致性,但3D導向方法通常視覺質量較差。FlashWorld包括一個雙模式預訓練階段,隨後是一個跨模式後訓練階段,有效地整合了兩種範式的優勢。具體來說,利用視頻擴散模型的先驗,我們首先預訓練一個雙模式多視角擴散模型,該模型同時支持MV導向和3D導向的生成模式。為了彌補3D導向生成的質量差距,我們進一步提出了一種跨模式後訓練蒸餾方法,通過將一致的3D導向模式的分佈匹配到高質量的MV導向模式。這不僅在保持3D一致性的同時提升了視覺質量,還減少了推理所需的去噪步驟。此外,我們提出了一種策略,在此過程中利用大量的單視角圖像和文本提示,以增強模型對分佈外輸入的泛化能力。大量實驗證明了我們方法的優越性和效率。
English
We propose FlashWorld, a generative model that produces 3D scenes from a
single image or text prompt in seconds, 10~100times faster than previous
works while possessing superior rendering quality. Our approach shifts from the
conventional multi-view-oriented (MV-oriented) paradigm, which generates
multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach
where the model directly produces 3D Gaussian representations during multi-view
generation. While ensuring 3D consistency, 3D-oriented method typically suffers
poor visual quality. FlashWorld includes a dual-mode pre-training phase
followed by a cross-mode post-training phase, effectively integrating the
strengths of both paradigms. Specifically, leveraging the prior from a video
diffusion model, we first pre-train a dual-mode multi-view diffusion model,
which jointly supports MV-oriented and 3D-oriented generation modes. To bridge
the quality gap in 3D-oriented generation, we further propose a cross-mode
post-training distillation by matching distribution from consistent 3D-oriented
mode to high-quality MV-oriented mode. This not only enhances visual quality
while maintaining 3D consistency, but also reduces the required denoising steps
for inference. Also, we propose a strategy to leverage massive single-view
images and text prompts during this process to enhance the model's
generalization to out-of-distribution inputs. Extensive experiments demonstrate
the superiority and efficiency of our method.