稳定身份:第一眼将任何人插入任何地方
StableIdentity: Inserting Anybody into Anywhere at First Sight
January 29, 2024
作者: Qinghe Wang, Xu Jia, Xiaomin Li, Taiqing Li, Liqian Ma, Yunzhi Zhuge, Huchuan Lu
cs.AI
摘要
最近在大型预训练文本到图像模型方面取得的进展展示了前所未有的高质量以人为中心的生成能力,然而,定制人脸身份仍然是一个棘手的问题。现有方法无法确保稳定的身份保留和灵活的可编辑性,即使在训练过程中为每个主题提供了多张图像。在这项工作中,我们提出了StableIdentity,它允许仅使用一张人脸图像进行身份一致的重新语境化。更具体地说,我们使用一个带有身份先验的人脸编码器来编码输入的人脸,然后将人脸表示投射到一个可编辑先验空间中,该空间是由名人姓名构建的。通过结合身份先验和可编辑性先验,学习到的身份可以在各种上下文中注入。此外,我们设计了一个遮罩的两阶段扩散损失,以增强输入人脸的像素级感知,并保持生成的多样性。大量实验证明我们的方法优于先前的定制方法。此外,学习到的身份可以灵活地与诸如ControlNet之类的现成模块结合使用。值得注意的是,据我们所知,我们是第一个在视频/三维生成中直接注入从单个图像学习到的身份而无需微调的研究。我们相信,所提出的StableIdentity是统一图像、视频和三维定制生成模型的重要一步。
English
Recent advances in large pretrained text-to-image models have shown
unprecedented capabilities for high-quality human-centric generation, however,
customizing face identity is still an intractable problem. Existing methods
cannot ensure stable identity preservation and flexible editability, even with
several images for each subject during training. In this work, we propose
StableIdentity, which allows identity-consistent recontextualization with just
one face image. More specifically, we employ a face encoder with an identity
prior to encode the input face, and then land the face representation into a
space with an editable prior, which is constructed from celeb names. By
incorporating identity prior and editability prior, the learned identity can be
injected anywhere with various contexts. In addition, we design a masked
two-phase diffusion loss to boost the pixel-level perception of the input face
and maintain the diversity of generation. Extensive experiments demonstrate our
method outperforms previous customization methods. In addition, the learned
identity can be flexibly combined with the off-the-shelf modules such as
ControlNet. Notably, to the best knowledge, we are the first to directly inject
the identity learned from a single image into video/3D generation without
finetuning. We believe that the proposed StableIdentity is an important step to
unify image, video, and 3D customized generation models.