Points-to-3D:填補稀疏點和可控形狀文本生成之間的差距
Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation
July 26, 2023
作者: Chaohui Yu, Qiang Zhou, Jingliang Li, Zhe Zhang, Zhibin Wang, Fan Wang
cs.AI
摘要
最近,文字轉3D生成引起了相當大的關注,這得益於在數十億張圖像-文字對上訓練的2D擴散模型。現有方法主要依賴分數蒸餾,以利用2D擴散先驗來監督3D模型的生成,例如NeRF。然而,分數蒸餾容易遭受視角不一致問題,而隱式的NeRF建模也可能導致任意形狀,進而導致不夠逼真和難以控制的3D生成。在這項工作中,我們提出了一個靈活的Points-to-3D框架,以從2D和3D擴散模型中提煉知識,彌合稀疏但自由可用的3D點與逼真形狀可控的3D生成之間的差距。Points-to-3D的核心思想是引入可控制的稀疏3D點來引導文字轉3D生成。具體來說,我們使用從3D擴散模型Point-E生成的稀疏點雲作為幾何先驗,條件是單張參考圖像。為了更好地利用稀疏3D點,我們提出了一個高效的點雲引導損失,以自適應地驅動NeRF的幾何形狀與稀疏3D點的形狀對齊。除了控制幾何形狀,我們提出了為了更具視角一致性的外觀優化NeRF的方法。具體來說,我們對公開可用的2D圖像擴散模型ControlNet進行分數蒸餾,條件是文字以及學習到的緊湊幾何深度圖。定性和定量比較表明,Points-to-3D改善了視角一致性,並實現了良好的形狀可控性,用於文字轉3D生成。Points-to-3D為用戶提供了一種改進和控制文字轉3D生成的新途徑。
English
Text-to-3D generation has recently garnered significant attention, fueled by
2D diffusion models trained on billions of image-text pairs. Existing methods
primarily rely on score distillation to leverage the 2D diffusion priors to
supervise the generation of 3D models, e.g., NeRF. However, score distillation
is prone to suffer the view inconsistency problem, and implicit NeRF modeling
can also lead to an arbitrary shape, thus leading to less realistic and
uncontrollable 3D generation. In this work, we propose a flexible framework of
Points-to-3D to bridge the gap between sparse yet freely available 3D points
and realistic shape-controllable 3D generation by distilling the knowledge from
both 2D and 3D diffusion models. The core idea of Points-to-3D is to introduce
controllable sparse 3D points to guide the text-to-3D generation. Specifically,
we use the sparse point cloud generated from the 3D diffusion model, Point-E,
as the geometric prior, conditioned on a single reference image. To better
utilize the sparse 3D points, we propose an efficient point cloud guidance loss
to adaptively drive the NeRF's geometry to align with the shape of the sparse
3D points. In addition to controlling the geometry, we propose to optimize the
NeRF for a more view-consistent appearance. To be specific, we perform score
distillation to the publicly available 2D image diffusion model ControlNet,
conditioned on text as well as depth map of the learned compact geometry.
Qualitative and quantitative comparisons demonstrate that Points-to-3D improves
view consistency and achieves good shape controllability for text-to-3D
generation. Points-to-3D provides users with a new way to improve and control
text-to-3D generation.