MagiCapture: 고해상도 다중 개념 초상화 커스터마이징

초록

Stable Diffusion을 포함한 대규모 텍스트-이미지 모델은 고해상도의 사실적인 초상화 이미지를 생성할 수 있습니다. 이러한 모델을 개인화하여 특정 주제나 스타일을 참조 이미지 세트를 사용해 합성하려는 연구가 활발히 진행되고 있습니다. 그러나 이러한 개인화 방법이 그럴듯한 결과를 내놓음에도 불구하고, 생성된 이미지는 종종 사실성을 충분히 달성하지 못하며 상업적으로 활용 가능한 수준에 이르지 못합니다. 이는 특히 초상화 이미지 생성에서 두드러지는데, 인간의 얼굴에 나타나는 부자연스러운 결함은 우리의 본능적인 편향 때문에 쉽게 식별됩니다. 이를 해결하기 위해, 우리는 MagiCapture를 소개합니다. 이는 소수의 주제 및 스타일 참조만을 사용하여 고해상도 초상화 이미지를 생성하기 위해 주제와 스타일 개념을 통합하는 개인화 방법입니다. 예를 들어, 몇 장의 무작위 셀카를 제공하면, 우리의 미세 조정된 모델은 여권 사진이나 프로필 사진과 같은 특정 스타일의 고품질 초상화 이미지를 생성할 수 있습니다. 이 작업의 주요 어려움은 구성된 개념에 대한 실측 데이터가 없어 최종 출력의 품질이 저하되고 원본 주제의 정체성이 변할 수 있다는 점입니다. 이러한 문제를 해결하기 위해, 우리는 새로운 Attention Refocusing 손실 함수와 보조 사전 정보를 제안하며, 이 둘 모두 약한 감독 학습 환경 내에서 강력한 학습을 가능하게 합니다. 또한, 우리의 파이프라인은 고도로 사실적인 출력물을 보장하기 위한 추가적인 후처리 단계를 포함합니다. MagiCapture는 양적 및 질적 평가 모두에서 다른 기준 모델들을 능가하며, 비인간 객체에도 일반화될 수 있습니다.

English

Large-scale text-to-image models including Stable Diffusion are capable of generating high-fidelity photorealistic portrait images. There is an active research area dedicated to personalizing these models, aiming to synthesize specific subjects or styles using provided sets of reference images. However, despite the plausible results from these personalization methods, they tend to produce images that often fall short of realism and are not yet on a commercially viable level. This is particularly noticeable in portrait image generation, where any unnatural artifact in human faces is easily discernible due to our inherent human bias. To address this, we introduce MagiCapture, a personalization method for integrating subject and style concepts to generate high-resolution portrait images using just a few subject and style references. For instance, given a handful of random selfies, our fine-tuned model can generate high-quality portrait images in specific styles, such as passport or profile photos. The main challenge with this task is the absence of ground truth for the composed concepts, leading to a reduction in the quality of the final output and an identity shift of the source subject. To address these issues, we present a novel Attention Refocusing loss coupled with auxiliary priors, both of which facilitate robust learning within this weakly supervised learning setting. Our pipeline also includes additional post-processing steps to ensure the creation of highly realistic outputs. MagiCapture outperforms other baselines in both quantitative and qualitative evaluations and can also be generalized to other non-human objects.

MagiCapture: 고해상도 다중 개념 초상화 커스터마이징

MagiCapture: High-Resolution Multi-Concept Portrait Customization

초록

Support