Vanast: 合成トリプレット監視による人物画像アニメーションを用いた仮想試着

要旨

本論文では、単一の人物画像、衣服画像、およびポーズガイド動画から、衣服転写された人物アニメーション動画を直接生成する統一フレームワーク「Vanast」を提案する。従来の2段階パイプラインは、画像ベースの仮想試着とポーズ駆動アニメーションを別個のプロセスとして扱うため、アイデンティティの変移、衣服の歪み、表裏の不整合が生じやすい。我々のモデルは、一貫性のある合成を実現するため、全プロセスを単一の統一ステップで実行することでこれらの課題に対処する。この設定を可能にするため、大規模な三重項教師データを構築した。我々のデータ生成パイプラインは、衣服カタログ画像とは異なる代替衣装を着用したアイデンティティ保存性の高い人物画像の生成、単一の衣服とポーズ付き動画のペアという制限を克服するための上半身・下半身の完全な衣服三重項の取得、および衣服カタログ画像を必要としない多様な実世界三重項の構築を含む。さらに、ビデオ拡散トランスフォーマーのためのデュアルモジュールアーキテクチャを導入し、訓練の安定化、事前学習済み生成品質の維持、衣服の正確性、ポーズの忠実度、アイデンティティの保存性を向上させるとともに、ゼロショット衣服補間を可能にした。これらの貢献により、Vanastは多様な衣服タイプにわたって高精細かつアイデンティティに一貫性のあるアニメーションを生成できる。

English

We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.

Vanast: 合成トリプレット監視による人物画像アニメーションを用いた仮想試着

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

要旨

Support