iTryOn: 空間意味的ガイダンスによるインタラクティブビデオ仮想試着のマスタリング

要旨

ビデオ仮想試着（VVT）は、動画内の人物が着用している衣服を、新しいものにシームレスに置き換えることを目的としています。既存の手法では時間的一貫性の維持において大きな進歩が見られるものの、その多くはモデルが単に衣服を披露するだけの非インタラクティブなシナリオに限られています。この制約は、実際のアパレル提示における重要な側面、すなわち人間と衣服の能動的なインタラクションを見落としています。このギャップを埋めるため、我々は「インタラクティブビデオ仮想試着（Interactive VVT）」という新たな挑戦的タスクを導入し、定式化します。このタスクでは、動画内の被写体が衣服に能動的に関与します。このタスクは、単なるテクスチャ保存を超えた独自の課題をもたらします。具体的には、(1) 標準的な姿勢情報からインタラクションの意味的曖昧性を解決すること、そして (2) インタラクションの瞬間が疎で短い動画から、複雑な衣服の変形を学習すること、です。これらの課題に取り組むため、我々は大規模ビデオ拡散Transformerを基盤とした新規フレームワークiTryOnを提案します。iTryOnは、複雑なダイナミクスの生成を導くマルチレベルインタラクション注入機構を先駆けて導入します。空間レベルでは、衣服に依存しない3Dハンド事前情報を導入し、手と衣服の正確な接触のための細かいガイダンスを提供し、空間的曖昧性を効果的に解決します。意味レベルでは、iTryOnは全体的な文脈のためのグローバルキャプションと、局所的なインタラクションのためのタイムスタンプ付きアクションキャプションを活用し、これらを新たなAction-aware Rotational Position Embedding（A-RoPE）によって同期します。広範な実験により、iTryOnは従来のVVTベンチマークで最先端の性能を達成するだけでなく、新たなインタラクティブ設定においても圧倒的なリードを確立し、より動的で制御可能な仮想試着体験への重要な一歩を示しています。

English

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.