HarmonyView：在一图到3D中协调一致性和多样性

摘要

最近在单图像三维生成方面取得的进展突显了多视角一致性的重要性，利用在互联网规模图像上预训练的大规模扩散模型中的三维先验知识。然而，在研究领域中，由于将二维图像转换为三维内容存在的模糊性，新视角多样性的方面仍未得到充分探讨，其中可能出现众多潜在形状。在这里，我们旨在通过同时解决一致性和多样性来填补这一研究空白。然而，在这两个方面之间取得平衡面临着相当大的挑战，因为它们固有地存在权衡。本文介绍了HarmonyView，这是一种简单而有效的扩散采样技术，擅长分解单图像三维生成中的两个复杂方面：一致性和多样性。这种方法为在采样过程中更细致地探索这两个关键维度铺平了道路。此外，我们提出了一种基于CLIP图像和文本编码器的新评估指标，全面评估生成视图的多样性，这与人类评估者的判断密切相关。在实验中，HarmonyView实现了一种和谐的平衡，展示了在一致性和多样性方面的双赢局面。

English

Recent progress in single-image 3D generation highlights the importance of multi-view coherency, leveraging 3D priors from large-scale diffusion models pretrained on Internet-scale images. However, the aspect of novel-view diversity remains underexplored within the research landscape due to the ambiguity in converting a 2D image into 3D content, where numerous potential shapes can emerge. Here, we aim to address this research gap by simultaneously addressing both consistency and diversity. Yet, striking a balance between these two aspects poses a considerable challenge due to their inherent trade-offs. This work introduces HarmonyView, a simple yet effective diffusion sampling technique adept at decomposing two intricate aspects in single-image 3D generation: consistency and diversity. This approach paves the way for a more nuanced exploration of the two critical dimensions within the sampling process. Moreover, we propose a new evaluation metric based on CLIP image and text encoders to comprehensively assess the diversity of the generated views, which closely aligns with human evaluators' judgments. In experiments, HarmonyView achieves a harmonious balance, demonstrating a win-win scenario in both consistency and diversity.

HarmonyView：在一图到3D中协调一致性和多样性

HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D

摘要

Support