Sherpa3D：通过粗糙3D提升高保真度文本到3D生成

摘要

最近，通过利用2D和3D扩散模型，从文本提示中创建3D内容展示出了显著的进展。虽然3D扩散模型确保了出色的多视角一致性，但它们生成高质量和多样化的3D资源的能力受到了有限的3D数据的限制。相比之下，2D扩散模型找到了一种提炼方法，可以在没有任何3D数据的情况下实现出色的泛化和丰富的细节。然而，2D提升方法存在固有的视角不可知模糊性，从而导致严重的多面人问题，即文本提示未能提供足够的指导以学习连贯的3D结果。我们研究如何充分利用易于获取的粗糙3D知识来增强提示，并引导2D提升优化以进行细化，而不是重新训练昂贵的视角感知模型。在本文中，我们提出了Sherpa3D，这是一个新的文本到3D框架，同时实现了高保真度、泛化性和几何一致性。具体来说，我们设计了一对指导策略，这些策略源自3D扩散模型生成的粗糙3D先验：用于几何保真度的结构指导和用于3D连贯性的语义指导。通过这两种指导，2D扩散模型丰富了3D内容，产生了多样化和高质量的结果。大量实验证明，我们的Sherpa3D在质量和3D一致性方面优于最先进的文本到3D方法。

English

Recently, 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency, their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However, 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues, where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model, we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper, we propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously. Specifically, we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance, the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency.

Sherpa3D：通过粗糙3D提升高保真度文本到3D生成

Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

摘要

Support