Dress&Dance: お気に入りの衣装で自由にダンス - テクニカルプレビュー

要旨

本論文では、Dress&Danceを紹介します。これは、ユーザーが所望の衣装を着用し、与えられた参照映像に従って動く高品質な5秒間24FPSのバーチャル試着動画を1152x720解像度で生成するビデオ拡散フレームワークです。本手法は単一のユーザー画像を必要とし、トップス、ボトムス、ワンピースの衣装、およびトップスとボトムスの同時試着を単一パスでサポートします。本フレームワークの鍵となるのは、CondNetという新しい条件付けネットワークです。CondNetはアテンションを活用してテキスト、画像、ビデオといったマルチモーダル入力を統合し、衣装の登録と動きの忠実度を向上させます。CondNetは、限られたビデオデータとより大規模で容易に利用可能な画像データセットを組み合わせた異種混合のトレーニングデータを用いて、多段階のプログレッシブな方法で学習されます。Dress&Danceは既存のオープンソースおよび商用ソリューションを上回り、高品質で柔軟な試着体験を実現します。

English

We present Dress&Dance, a video diffusion framework that generates high quality 5-second-long 24 FPS virtual try-on videos at 1152x720 resolution of a user wearing desired garments while moving in accordance with a given reference video. Our approach requires a single user image and supports a range of tops, bottoms, and one-piece garments, as well as simultaneous tops and bottoms try-on in a single pass. Key to our framework is CondNet, a novel conditioning network that leverages attention to unify multi-modal inputs (text, images, and videos), thereby enhancing garment registration and motion fidelity. CondNet is trained on heterogeneous training data, combining limited video data and a larger, more readily available image dataset, in a multistage progressive manner. Dress&Dance outperforms existing open source and commercial solutions and enables a high quality and flexible try-on experience.

Dress&Dance: お気に入りの衣装で自由にダンス - テクニカルプレビュー

Dress&Dance: Dress up and Dance as You Like It - Technical Preview

要旨

Support