Droplet3D:视频中的常识先验助力三维生成
Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation
August 28, 2025
作者: Xiaochuan Li, Guoguang Du, Runze Zhang, Liang Jin, Qi Jia, Lihua Lu, Zhenhua Guo, Yaqian Zhao, Haiyang Liu, Tianqi Wang, Changsheng Li, Xiaoli Gong, Rengang Li, Baoyu Fan
cs.AI
摘要
规模定律已证实了在大数据训练下,模型在文本、图像及视频领域的创造性生成中所取得的成功与潜力。然而,这一范式在三维领域面临数据稀缺的挑战,因为相较于上述模态,互联网上可用的三维数据量远为有限。幸运的是,存在大量视频,它们天然蕴含常识先验,为缓解因原生三维数据有限导致的泛化瓶颈提供了替代的监督信号。一方面,捕捉物体或场景多视角的视频为三维生成提供了空间一致性先验;另一方面,视频中丰富的语义信息使得生成内容更忠实于文本提示且语义上更为合理。本文探讨了如何将视频模态应用于三维资产生成,从数据集到模型全方位展开。我们推出了Droplet3D-4M,首个带有多视图级别标注的大规模视频数据集,并训练了Droplet3D,一个支持图像及密集文本输入的生成模型。大量实验验证了我们方法的有效性,展示了其生成空间一致且语义合理内容的能力。此外,与主流的三维解决方案相比,我们的方法展现出向场景级应用扩展的潜力,这表明视频中的常识先验极大地促进了三维创作。我们已开源所有资源,包括数据集、代码、技术框架及模型权重:https://dropletx.github.io/。
English
Scaling laws have validated the success and promise of large-data-trained
models in creative generation across text, image, and video domains. However,
this paradigm faces data scarcity in the 3D domain, as there is far less of it
available on the internet compared to the aforementioned modalities.
Fortunately, there exist adequate videos that inherently contain commonsense
priors, offering an alternative supervisory signal to mitigate the
generalization bottleneck caused by limited native 3D data. On the one hand,
videos capturing multiple views of an object or scene provide a spatial
consistency prior for 3D generation. On the other hand, the rich semantic
information contained within the videos enables the generated content to be
more faithful to the text prompts and semantically plausible. This paper
explores how to apply the video modality in 3D asset generation, spanning
datasets to models. We introduce Droplet3D-4M, the first large-scale video
dataset with multi-view level annotations, and train Droplet3D, a generative
model supporting both image and dense text input. Extensive experiments
validate the effectiveness of our approach, demonstrating its ability to
produce spatially consistent and semantically plausible content. Moreover, in
contrast to the prevailing 3D solutions, our approach exhibits the potential
for extension to scene-level applications. This indicates that the commonsense
priors from the videos significantly facilitate 3D creation. We have
open-sourced all resources including the dataset, code, technical framework,
and model weights: https://dropletx.github.io/.