Droplet3D:從影片中提取的常識先驗促進3D生成
Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation
August 28, 2025
作者: Xiaochuan Li, Guoguang Du, Runze Zhang, Liang Jin, Qi Jia, Lihua Lu, Zhenhua Guo, Yaqian Zhao, Haiyang Liu, Tianqi Wang, Changsheng Li, Xiaoli Gong, Rengang Li, Baoyu Fan
cs.AI
摘要
規模化定律已驗證了基於大量數據訓練的模型在文本、圖像和視頻領域創意生成中的成功與潛力。然而,在3D領域,這一範式面臨數據稀缺的挑戰,因為相比前述媒介,互聯網上可獲取的3D數據量遠為不足。幸運的是,存在大量視頻,它們內含常識先驗,為緩解因原生3D數據有限而導致的泛化瓶頸提供了替代的監督信號。一方面,捕捉物體或場景多視角的視頻為3D生成提供了空間一致性先驗;另一方面,視頻中蘊含的豐富語義信息使得生成內容更忠實於文本提示且語義上更為合理。本文探討了如何將視頻模態應用於3D資產生成,從數據集到模型全面覆蓋。我們介紹了Droplet3D-4M,首個帶有多視角級別註釋的大規模視頻數據集,並訓練了Droplet3D,這是一個支持圖像和密集文本輸入的生成模型。大量實驗驗證了我們方法的有效性,展示了其生成空間一致且語義合理內容的能力。此外,與現有的3D解決方案相比,我們的方法展現了向場景級應用擴展的潛力,這表明來自視頻的常識先驗極大地促進了3D創作。我們已開源所有資源,包括數據集、代碼、技術框架及模型權重:https://dropletx.github.io/。
English
Scaling laws have validated the success and promise of large-data-trained
models in creative generation across text, image, and video domains. However,
this paradigm faces data scarcity in the 3D domain, as there is far less of it
available on the internet compared to the aforementioned modalities.
Fortunately, there exist adequate videos that inherently contain commonsense
priors, offering an alternative supervisory signal to mitigate the
generalization bottleneck caused by limited native 3D data. On the one hand,
videos capturing multiple views of an object or scene provide a spatial
consistency prior for 3D generation. On the other hand, the rich semantic
information contained within the videos enables the generated content to be
more faithful to the text prompts and semantically plausible. This paper
explores how to apply the video modality in 3D asset generation, spanning
datasets to models. We introduce Droplet3D-4M, the first large-scale video
dataset with multi-view level annotations, and train Droplet3D, a generative
model supporting both image and dense text input. Extensive experiments
validate the effectiveness of our approach, demonstrating its ability to
produce spatially consistent and semantically plausible content. Moreover, in
contrast to the prevailing 3D solutions, our approach exhibits the potential
for extension to scene-level applications. This indicates that the commonsense
priors from the videos significantly facilitate 3D creation. We have
open-sourced all resources including the dataset, code, technical framework,
and model weights: https://dropletx.github.io/.