ChatPaper.aiChatPaper

DreamVVT:通过分阶段扩散Transformer框架实现真实场景下的视频虚拟试穿

DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework

August 4, 2025
作者: Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang, Mingyuan Gao, Xin Dong
cs.AI

摘要

视频虚拟试穿(VVT)技术因其在电子商务广告和娱乐领域的广阔应用前景,已引起学术界的高度关注。然而,现有的大多数端到端方法严重依赖稀缺的成对服装中心数据集,且未能有效利用先进视觉模型的先验知识和测试时输入,导致在无约束场景下难以精确保留服装的细粒度细节并维持时间一致性。为解决这些挑战,我们提出了DreamVVT,一个精心设计的两阶段框架,基于扩散变换器(DiTs)构建,其天生具备利用多样化的非配对人体中心数据的能力,以增强在现实场景中的适应性。为进一步利用预训练模型的先验知识和测试时输入,在第一阶段,我们从输入视频中采样代表性帧,并采用集成视觉语言模型(VLM)的多帧试穿模型,合成高保真且语义一致的关键帧试穿图像。这些图像为后续视频生成提供了补充的外观指导。在第二阶段,从输入内容中提取骨架图以及细粒度的运动和外观描述,连同关键帧试穿图像一起输入到配备LoRA适配器的预训练视频生成模型中。这确保了未见区域的长时时间连贯性,并实现了高度逼真的动态运动。大量的定量和定性实验表明,DreamVVT在现实场景中保留服装细节内容和时间稳定性方面超越了现有方法。我们的项目页面为https://virtu-lab.github.io/。
English
Video virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment. However, most existing end-to-end methods rely heavily on scarce paired garment-centric datasets and fail to effectively leverage priors of advanced visual models and test-time inputs, making it challenging to accurately preserve fine-grained garment details and maintain temporal consistency in unconstrained scenarios. To address these challenges, we propose DreamVVT, a carefully designed two-stage framework built upon Diffusion Transformers (DiTs), which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios. To further leverage prior knowledge from pretrained models and test-time inputs, in the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent keyframe try-on images. These images serve as complementary appearance guidance for subsequent video generation. In the second stage, skeleton maps together with fine-grained motion and appearance descriptions are extracted from the input content, and these along with the keyframe try-on images are then fed into a pretrained video generation model enhanced with LoRA adapters. This ensures long-term temporal coherence for unseen regions and enables highly plausible dynamic motions. Extensive quantitative and qualitative experiments demonstrate that DreamVVT surpasses existing methods in preserving detailed garment content and temporal stability in real-world scenarios. Our project page https://virtu-lab.github.io/
PDF132August 7, 2025