StableIdentity: 첫눈에 누구나 어디든 삽입하기

초록

대규모 사전 학습된 텍스트-이미지 모델의 최근 발전은 고품질의 인간 중심 생성에 있어 전례 없는 능력을 보여주었지만, 얼굴 아이덴티티를 맞춤화하는 것은 여전히 해결하기 어려운 문제로 남아 있습니다. 기존 방법들은 훈련 중 각 대상에 대해 여러 이미지를 사용하더라도 안정적인 아이덴티티 보존과 유연한 편집성을 보장할 수 없습니다. 본 연구에서는 단 하나의 얼굴 이미지만으로도 아이덴티티 일관성을 유지하며 재구성할 수 있는 StableIdentity를 제안합니다. 구체적으로, 우리는 입력된 얼굴을 인코딩하기 위해 아이덴티티 사전을 갖춘 얼굴 인코더를 사용하고, 셀럽 이름으로부터 구성된 편집 가능한 사전이 있는 공간에 얼굴 표현을 배치합니다. 아이덴티티 사전과 편집 가능성 사전을 통합함으로써 학습된 아이덴티티는 다양한 맥락 속에서 어디에나 주입될 수 있습니다. 또한, 입력된 얼굴의 픽셀 수준 인식을 강화하고 생성의 다양성을 유지하기 위해 마스킹된 두 단계 확산 손실을 설계했습니다. 광범위한 실험을 통해 우리의 방법이 이전의 맞춤화 방법들을 능가함을 입증했습니다. 또한, 학습된 아이덴티티는 ControlNet과 같은 기성 모듈과 유연하게 결합될 수 있습니다. 특히, 우리가 아는 한, 단일 이미지로부터 학습된 아이덴티티를 파인튜닝 없이 비디오/3D 생성에 직접 주입하는 첫 번째 사례입니다. 우리는 제안된 StableIdentity가 이미지, 비디오, 3D 맞춤화 생성 모델을 통합하는 데 중요한 단계라고 믿습니다.

English

Recent advances in large pretrained text-to-image models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image. More specifically, we employ a face encoder with an identity prior to encode the input face, and then land the face representation into a space with an editable prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet. Notably, to the best knowledge, we are the first to directly inject the identity learned from a single image into video/3D generation without finetuning. We believe that the proposed StableIdentity is an important step to unify image, video, and 3D customized generation models.

StableIdentity: 첫눈에 누구나 어디든 삽입하기

StableIdentity: Inserting Anybody into Anywhere at First Sight

초록

Support