Voost: 双方向のバーチャル試着・試し脱ぎのための統一かつスケーラブルなDiffusion Transformer

要旨

バーチャル試着は、対象の衣服を着用した人物のリアルな画像を合成することを目的としていますが、特にポーズや外見の変化下での衣服と身体の対応関係を正確にモデル化することは、依然として大きな課題です。本論文では、Voostを提案します。Voostは、単一の拡散トランスフォーマーを用いてバーチャル試着と試着解除を共同で学習する、統一されたスケーラブルなフレームワークです。両タスクを共同でモデル化することにより、Voostは各衣服と人物のペアが両方向を監督できるようにし、生成方向や衣服カテゴリーに対する柔軟な条件付けをサポートします。これにより、タスク固有のネットワーク、補助的な損失関数、追加のラベルなしで、衣服と身体の関係推論を強化します。さらに、解像度やマスクの変化に対するロバスト性を向上させるための注意温度スケーリングと、タスク間の双方向の一貫性を活用する自己修正サンプリングという2つの推論時技術を導入します。広範な実験により、Voostが試着と試着解除のベンチマークにおいて最先端の結果を達成し、アライメント精度、視覚的忠実度、一般化能力において強力なベースラインを一貫して上回ることが実証されました。

English

Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.

Voost: 双方向のバーチャル試着・試し脱ぎのための統一かつスケーラブルなDiffusion Transformer

Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off

要旨

Support