MagiCapture: 高解像度マルチコンセプトポートレートカスタマイズ

要旨

Stable Diffusionを含む大規模なテキスト画像生成モデルは、高精細でフォトリアルなポートレート画像を生成することが可能です。これらのモデルを特定の被写体やスタイルを合成するためにパーソナライズすることを目的とした研究が活発に行われており、提供された参照画像セットを使用して特定の被写体やスタイルを生成しようとしています。しかし、これらのパーソナライズ手法がもたらす結果は一見妥当ではあるものの、リアリズムに欠けることが多く、まだ商業的に実用可能なレベルには至っていません。これは特にポートレート画像生成において顕著で、人間の顔における不自然なアーティファクトは、人間の持つ固有のバイアスによって容易に認識されてしまいます。この問題に対処するため、我々はMagiCaptureを提案します。これは、わずかな被写体とスタイルの参照画像を使用して、被写体とスタイルの概念を統合し、高解像度のポートレート画像を生成するパーソナライズ手法です。例えば、いくつかのランダムな自撮り写真を与えると、我々のファインチューニングされたモデルは、パスポート写真やプロフィール写真などの特定のスタイルで高品質なポートレート画像を生成することができます。このタスクにおける主な課題は、合成された概念に対する正解データが存在しないことにより、最終的な出力の品質が低下し、被写体のアイデンティティが変化してしまうことです。これらの問題に対処するため、我々は新しいAttention Refocusing損失と補助的な事前情報を組み合わせた手法を提案します。これらは、この弱教師あり学習設定において堅牢な学習を促進します。また、我々のパイプラインには、高度にリアルな出力を確保するための追加の後処理ステップも含まれています。MagiCaptureは、定量的および定性的な評価において他のベースラインを上回り、非人間のオブジェクトにも一般化することが可能です。

English

Large-scale text-to-image models including Stable Diffusion are capable of generating high-fidelity photorealistic portrait images. There is an active research area dedicated to personalizing these models, aiming to synthesize specific subjects or styles using provided sets of reference images. However, despite the plausible results from these personalization methods, they tend to produce images that often fall short of realism and are not yet on a commercially viable level. This is particularly noticeable in portrait image generation, where any unnatural artifact in human faces is easily discernible due to our inherent human bias. To address this, we introduce MagiCapture, a personalization method for integrating subject and style concepts to generate high-resolution portrait images using just a few subject and style references. For instance, given a handful of random selfies, our fine-tuned model can generate high-quality portrait images in specific styles, such as passport or profile photos. The main challenge with this task is the absence of ground truth for the composed concepts, leading to a reduction in the quality of the final output and an identity shift of the source subject. To address these issues, we present a novel Attention Refocusing loss coupled with auxiliary priors, both of which facilitate robust learning within this weakly supervised learning setting. Our pipeline also includes additional post-processing steps to ensure the creation of highly realistic outputs. MagiCapture outperforms other baselines in both quantitative and qualitative evaluations and can also be generalized to other non-human objects.

MagiCapture: 高解像度マルチコンセプトポートレートカスタマイズ

MagiCapture: High-Resolution Multi-Concept Portrait Customization

要旨

Support