超人類:具有潛在結構擴散的超逼真人類生成
HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion
October 12, 2023
作者: Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui Liu, Ziwei Liu, Sergey Tulyakov
cs.AI
摘要
儘管大規模文本轉圖像模型取得了顯著進展,實現超逼真的人類圖像生成仍然是一個令人嚮往但尚未解決的任務。現有模型如穩定擴散和 DALL-E 2 傾向於生成具有不連貫部分或不自然姿勢的人類圖像。為應對這些挑戰,我們的關鍵見解是人類圖像在多個粒度上從粗粒級身體骨架到細粒度空間幾何結構上具有結構性。因此,在一個模型中捕捉明確外觀與潛在結構之間的相關性對於生成連貫自然的人類圖像至關重要。為此,我們提出了一個統一框架 HyperHuman,用於生成高逼真度和多樣布局的野外人類圖像。具體來說,1)我們首先建立了一個大規模以人為中心的數據集 HumanVerse,其中包含 3.4 億張圖像,具有全面的標註,如人體姿勢、深度和表面法線。2)接下來,我們提出了一個潛在結構擴散模型,同時對深度和表面法線進行降噪,並與合成的 RGB 圖像一起。我們的模型強化了圖像外觀、空間關係和幾何在一個統一網絡中的聯合學習,模型中的每個分支相互補充,具有結構意識和紋理豐富性。3)最後,為進一步提升視覺質量,我們提出了一個結構引導精煉器,用於構成更詳細生成更高分辨率的預測條件。大量實驗表明,我們的框架實現了最先進的性能,在不同場景下生成超逼真的人類圖像。項目頁面:https://snap-research.github.io/HyperHuman/
English
Despite significant advances in large-scale text-to-image models, achieving
hyper-realistic human image generation remains a desirable yet unsolved task.
Existing models like Stable Diffusion and DALL-E 2 tend to generate human
images with incoherent parts or unnatural poses. To tackle these challenges,
our key insight is that human image is inherently structural over multiple
granularities, from the coarse-level body skeleton to fine-grained spatial
geometry. Therefore, capturing such correlations between the explicit
appearance and latent structure in one model is essential to generate coherent
and natural human images. To this end, we propose a unified framework,
HyperHuman, that generates in-the-wild human images of high realism and diverse
layouts. Specifically, 1) we first build a large-scale human-centric dataset,
named HumanVerse, which consists of 340M images with comprehensive annotations
like human pose, depth, and surface normal. 2) Next, we propose a Latent
Structural Diffusion Model that simultaneously denoises the depth and surface
normal along with the synthesized RGB image. Our model enforces the joint
learning of image appearance, spatial relationship, and geometry in a unified
network, where each branch in the model complements to each other with both
structural awareness and textural richness. 3) Finally, to further boost the
visual quality, we propose a Structure-Guided Refiner to compose the predicted
conditions for more detailed generation of higher resolution. Extensive
experiments demonstrate that our framework yields the state-of-the-art
performance, generating hyper-realistic human images under diverse scenarios.
Project Page: https://snap-research.github.io/HyperHuman/