ChatPaper.aiChatPaper

iTryOn:以空間-語義引導掌握互動式視頻虛擬試穿

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

May 20, 2026
作者: Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang
cs.AI

摘要

影片虛擬試穿(VVT)旨在將影片中人物身上的衣物無縫替換為新衣物。儘管現有方法在維持時間一致性方面取得顯著進展,但它們主要局限於非互動場景,即模特僅展示衣物。此限制忽略了現實服裝展示中的一個關鍵面向:人與衣物的主動互動。為填補此缺口,我們提出並正式定義一項具挑戰性的新任務:互動式影片虛擬試穿(Interactive VVT),其中影片中的主體會主動與其衣物互動。此任務引入了超越單純紋理保留的獨特挑戰,包括:(1)從標準姿態資訊中解決互動的語意模糊性,以及(2)從互動時刻稀疏且短暫的影片中學習複雜的衣物變形。為應對這些挑戰,我們提出 iTryOn,一個基於大型影片擴散 Transformer 的新穎框架。iTryOn 首創多層級互動注入機制,以引導複雜動態的生成。在空間層級,我們引入與衣物無關的 3D 手部先驗,為精確的手部-衣物接觸提供細粒度引導,有效解決空間模糊性。在語意層級,iTryOn 利用全局描述提供整體上下文,並利用時間標記動作描述提供局部互動,透過我們新穎的動作感知旋轉位置嵌入(A-RoPE)進行同步。大量實驗證明,iTryOn 不僅在傳統 VVT 基準上達到最先進性能,也在新的互動設置中建立領先優勢,標誌著朝向更動態且可控的虛擬試穿體驗邁出重要一步。
English
Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.