HarmonyView：在一張圖像到3D中協調一致性與多樣性

摘要

最近在單張圖像3D生成方面取得的進展凸顯了多視角一致性的重要性，利用在互聯網規模圖像上預訓練的大規模擴散模型中的3D先驗。然而，在研究領域中，對於新視角多樣性的方面仍未得到充分探索，這是由於將2D圖像轉換為3D內容時存在的模糊性，可能出現眾多潛在形狀。在這裡，我們旨在通過同時解決一致性和多樣性來填補這一研究空白。然而，在這兩個方面之間取得平衡面臨著相當大的挑戰，因為它們固有地存在著權衡。本研究介紹了HarmonyView，這是一種簡單而有效的擴散採樣技術，擅長分解單張圖像3D生成中的兩個復雜方面：一致性和多樣性。這種方法為在採樣過程中更細緻地探索這兩個關鍵維度打開了一扇大門。此外，我們提出了一種基於CLIP圖像和文本編碼器的新評估指標，以全面評估生成視角的多樣性，這與人類評估者的判斷密切相符。在實驗中，HarmonyView實現了一種和諧的平衡，在一致性和多樣性方面展現出雙贏的情景。

English

Recent progress in single-image 3D generation highlights the importance of multi-view coherency, leveraging 3D priors from large-scale diffusion models pretrained on Internet-scale images. However, the aspect of novel-view diversity remains underexplored within the research landscape due to the ambiguity in converting a 2D image into 3D content, where numerous potential shapes can emerge. Here, we aim to address this research gap by simultaneously addressing both consistency and diversity. Yet, striking a balance between these two aspects poses a considerable challenge due to their inherent trade-offs. This work introduces HarmonyView, a simple yet effective diffusion sampling technique adept at decomposing two intricate aspects in single-image 3D generation: consistency and diversity. This approach paves the way for a more nuanced exploration of the two critical dimensions within the sampling process. Moreover, we propose a new evaluation metric based on CLIP image and text encoders to comprehensively assess the diversity of the generated views, which closely aligns with human evaluators' judgments. In experiments, HarmonyView achieves a harmonious balance, demonstrating a win-win scenario in both consistency and diversity.

HarmonyView：在一張圖像到3D中協調一致性與多樣性

HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D

摘要

Support