ChatPaper.aiChatPaper

iTryOn:利用空间语义引导实现交互式视频虚拟试穿

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

May 20, 2026
作者: Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang
cs.AI

摘要

视频虚拟试穿(VVT)旨在将视频中人物身上的服装无缝替换为新款式。现有方法虽在保持时间一致性方面取得了显著进展,但主要局限于非交互场景——模特仅展示服装,这忽略了现实服饰呈现的一个关键方面:主动的人-服装交互。为填补这一空白,我们提出并形式化了一项新的挑战性任务:交互式视频虚拟试穿(Interactive VVT),其中视频中的主体会主动与衣物互动。该任务在简单的纹理保留之外带来了独特挑战,包括:(1)从标准姿态信息中解决交互的语义歧义性,以及(2)从交互时刻稀疏且短暂的视频中学习复杂的服装形变。为应对这些挑战,我们提出iTryOn——一个基于大规模视频扩散Transformer的新型框架。iTryOn首创了多级交互注入机制来指导复杂动态的生成。在空间层面,我们引入与服装无关的3D手部先验,为精确的手-服装接触提供细粒度指导,有效解决空间歧义。在语义层面,iTryOn利用全局标题提供整体上下文,并利用带时间戳的动作标题提供局部交互信息,通过我们提出的动作感知旋转位置编码(A-RoPE)实现同步。大量实验表明,iTryOn不仅在传统VVT基准上达到最先进性能,还在新的交互设置中建立了显著领先优势,标志着向更动态、更可控的虚拟试穿体验迈出了重要一步。
English
Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.