稳定身份：第一眼将任何人插入任何地方

摘要

最近在大型预训练文本到图像模型方面取得的进展展示了前所未有的高质量以人为中心的生成能力，然而，定制人脸身份仍然是一个棘手的问题。现有方法无法确保稳定的身份保留和灵活的可编辑性，即使在训练过程中为每个主题提供了多张图像。在这项工作中，我们提出了StableIdentity，它允许仅使用一张人脸图像进行身份一致的重新语境化。更具体地说，我们使用一个带有身份先验的人脸编码器来编码输入的人脸，然后将人脸表示投射到一个可编辑先验空间中，该空间是由名人姓名构建的。通过结合身份先验和可编辑性先验，学习到的身份可以在各种上下文中注入。此外，我们设计了一个遮罩的两阶段扩散损失，以增强输入人脸的像素级感知，并保持生成的多样性。大量实验证明我们的方法优于先前的定制方法。此外，学习到的身份可以灵活地与诸如ControlNet之类的现成模块结合使用。值得注意的是，据我们所知，我们是第一个在视频/三维生成中直接注入从单个图像学习到的身份而无需微调的研究。我们相信，所提出的StableIdentity是统一图像、视频和三维定制生成模型的重要一步。

English

Recent advances in large pretrained text-to-image models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image. More specifically, we employ a face encoder with an identity prior to encode the input face, and then land the face representation into a space with an editable prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet. Notably, to the best knowledge, we are the first to directly inject the identity learned from a single image into video/3D generation without finetuning. We believe that the proposed StableIdentity is an important step to unify image, video, and 3D customized generation models.

稳定身份：第一眼将任何人插入任何地方

StableIdentity: Inserting Anybody into Anywhere at First Sight

摘要

Support