穩定身份:第一眼將任何人插入任何地方
StableIdentity: Inserting Anybody into Anywhere at First Sight
January 29, 2024
作者: Qinghe Wang, Xu Jia, Xiaomin Li, Taiqing Li, Liqian Ma, Yunzhi Zhuge, Huchuan Lu
cs.AI
摘要
最近在大型預訓練文本到圖像模型方面取得的進展展示了前所未有的高質量以人為中心生成能力,然而,定制面部身份仍然是一個棘手的問題。現有方法無法確保穩定的身份保留和靈活的可編輯性,即使在訓練期間為每個主題提供了幾張圖像。在這項工作中,我們提出了StableIdentity,它允許僅使用一張面部圖像進行身份一致的重新語境化。更具體地說,我們使用一個帶有身份先驗的面部編碼器來編碼輸入的面部,然後將面部表示放入一個可編輯先驗的空間中,該空間是從名人名字構建的。通過結合身份先驗和可編輯性先驗,學習到的身份可以注入到各種語境中的任何位置。此外,我們設計了一個遮罩的兩階段擴散損失,以提升輸入面部的像素級感知並保持生成的多樣性。大量實驗證明我們的方法優於先前的定制方法。此外,所學習到的身份可以靈活地與諸如ControlNet等現成模塊結合使用。值得注意的是,據我們所知,我們是第一個在不進行微調的情況下,將從單張圖像中學習到的身份直接注入到視頻/3D生成中。我們相信,所提出的StableIdentity是統一圖像、視頻和3D定制生成模型的重要一步。
English
Recent advances in large pretrained text-to-image models have shown
unprecedented capabilities for high-quality human-centric generation, however,
customizing face identity is still an intractable problem. Existing methods
cannot ensure stable identity preservation and flexible editability, even with
several images for each subject during training. In this work, we propose
StableIdentity, which allows identity-consistent recontextualization with just
one face image. More specifically, we employ a face encoder with an identity
prior to encode the input face, and then land the face representation into a
space with an editable prior, which is constructed from celeb names. By
incorporating identity prior and editability prior, the learned identity can be
injected anywhere with various contexts. In addition, we design a masked
two-phase diffusion loss to boost the pixel-level perception of the input face
and maintain the diversity of generation. Extensive experiments demonstrate our
method outperforms previous customization methods. In addition, the learned
identity can be flexibly combined with the off-the-shelf modules such as
ControlNet. Notably, to the best knowledge, we are the first to directly inject
the identity learned from a single image into video/3D generation without
finetuning. We believe that the proposed StableIdentity is an important step to
unify image, video, and 3D customized generation models.