TryOnDiffusion: 二つのUNetの物語

要旨

人物と別の人物が着用している衣服を描いた2枚の画像が与えられたとき、私たちの目標は、その衣服が入力された人物にどのように見えるかを可視化することです。重要な課題は、衣服のフォトリアルなディテールを保ちながら、被写体間の大幅な身体のポーズと形状の変化に対応するために衣服を変形させることです。従来の手法では、効果的なポーズと形状の変化を伴わずに衣服のディテールを保持することに焦点を当てるか、望ましい形状とポーズでの試着を可能にするが衣服のディテールを欠くかのいずれかでした。本論文では、2つのUNetを統合した拡散ベースのアーキテクチャ（Parallel-UNetと呼ぶ）を提案し、単一のネットワークで衣服のディテールを保持しつつ、大幅なポーズと身体の変化に対応するために衣服を変形させることが可能です。Parallel-UNetの背後にある主要なアイデアは次のとおりです：1）衣服はクロスアテンションメカニズムを介して暗黙的に変形され、2）衣服の変形と人物のブレンドは、2つの別々のタスクのシーケンスではなく、統一されたプロセスの一部として行われます。実験結果は、TryOnDiffusionが質的および量的に最先端の性能を達成していることを示しています。

English

Given two images depicting a person and a garment worn by another person, our goal is to generate a visualization of how the garment might look on the input person. A key challenge is to synthesize a photorealistic detail-preserving visualization of the garment, while warping the garment to accommodate a significant body pose and shape change across the subjects. Previous methods either focus on garment detail preservation without effective pose and shape variation, or allow try-on with the desired shape and pose but lack garment details. In this paper, we propose a diffusion-based architecture that unifies two UNets (referred to as Parallel-UNet), which allows us to preserve garment details and warp the garment for significant pose and body change in a single network. The key ideas behind Parallel-UNet include: 1) garment is warped implicitly via a cross attention mechanism, 2) garment warp and person blend happen as part of a unified process as opposed to a sequence of two separate tasks. Experimental results indicate that TryOnDiffusion achieves state-of-the-art performance both qualitatively and quantitatively.

TryOnDiffusion: 二つのUNetの物語

TryOnDiffusion: A Tale of Two UNets

要旨

Support