超人类:具有潜在结构扩散的超逼真人类生成
HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion
October 12, 2023
作者: Xian Liu, Jian Ren, Aliaksandr Siarohin, Ivan Skorokhodov, Yanyu Li, Dahua Lin, Xihui Liu, Ziwei Liu, Sergey Tulyakov
cs.AI
摘要
尽管大规模文本到图像模型取得了显著进展,实现超逼真人类图像生成仍然是一项令人向往但尚未解决的任务。现有模型如稳定扩散和DALL-E 2 往往会生成具有不连贯部分或不自然姿势的人类图像。为了解决这些挑战,我们的关键洞察是人类图像在多个粒度上固有地具有结构性,从粗粒度的身体骨架到细粒度的空间几何。因此,在一个模型中捕捉显式外观与潜在结构之间的相关性对于生成连贯自然的人类图像至关重要。为此,我们提出了一个统一框架,HyperHuman,用于生成高逼真度和多样布局的野外人类图像。具体来说,1)我们首先构建了一个大规模以人类为中心的数据集,名为HumanVerse,其中包含340M张图像,具有全面的注释,如人体姿势、深度和表面法线。2)接下来,我们提出了一个潜在结构扩散模型,该模型同时去噪深度和表面法线以及合成的RGB图像。我们的模型强化了图像外观、空间关系和几何在一个统一网络中的联合学习,在模型中的每个分支相互补充,既具有结构意识又具有纹理丰富性。3)最后,为了进一步提升视觉质量,我们提出了一个结构引导的精化器,用于组合预测条件,以更详细地生成更高分辨率的图像。大量实验证明,我们的框架实现了最先进的性能,在多种场景下生成超逼真的人类图像。项目页面:https://snap-research.github.io/HyperHuman/
English
Despite significant advances in large-scale text-to-image models, achieving
hyper-realistic human image generation remains a desirable yet unsolved task.
Existing models like Stable Diffusion and DALL-E 2 tend to generate human
images with incoherent parts or unnatural poses. To tackle these challenges,
our key insight is that human image is inherently structural over multiple
granularities, from the coarse-level body skeleton to fine-grained spatial
geometry. Therefore, capturing such correlations between the explicit
appearance and latent structure in one model is essential to generate coherent
and natural human images. To this end, we propose a unified framework,
HyperHuman, that generates in-the-wild human images of high realism and diverse
layouts. Specifically, 1) we first build a large-scale human-centric dataset,
named HumanVerse, which consists of 340M images with comprehensive annotations
like human pose, depth, and surface normal. 2) Next, we propose a Latent
Structural Diffusion Model that simultaneously denoises the depth and surface
normal along with the synthesized RGB image. Our model enforces the joint
learning of image appearance, spatial relationship, and geometry in a unified
network, where each branch in the model complements to each other with both
structural awareness and textural richness. 3) Finally, to further boost the
visual quality, we propose a Structure-Guided Refiner to compose the predicted
conditions for more detailed generation of higher resolution. Extensive
experiments demonstrate that our framework yields the state-of-the-art
performance, generating hyper-realistic human images under diverse scenarios.
Project Page: https://snap-research.github.io/HyperHuman/