ChatPaper.aiChatPaper

DreamVVT:通過階段式擴散變換器框架實現真實場景下的視頻虛擬試穿

DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework

August 4, 2025
作者: Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang, Mingyuan Gao, Xin Dong
cs.AI

摘要

影片虛擬試穿(VVT)技術因其在電子商務廣告和娛樂領域的廣闊應用前景而引起了學術界的廣泛關注。然而,現有的大多數端到端方法過度依賴稀缺的成對服裝中心數據集,未能有效利用先進視覺模型的先驗知識和測試時輸入,這使得在無約束場景中精確保留服裝細節和維持時間一致性變得極具挑戰。為應對這些挑戰,我們提出了DreamVVT,這是一個基於擴散變壓器(DiTs)精心設計的兩階段框架,其內在能力能夠利用多樣化的非配對人體中心數據來增強現實場景中的適應性。為了進一步利用預訓練模型的先驗知識和測試時輸入,在第一階段,我們從輸入影片中採樣代表性幀,並利用整合了視覺語言模型(VLM)的多幀試穿模型,合成高保真且語義一致的關鍵幀試穿圖像。這些圖像作為後續影片生成的補充外觀指導。在第二階段,從輸入內容中提取骨架圖以及細粒度運動和外觀描述,這些信息與關鍵幀試穿圖像一同輸入到一個通過LoRA適配器增強的預訓練影片生成模型中。這確保了未見區域的長期時間一致性,並實現了高度逼真的動態運動。大量的定量和定性實驗表明,DreamVVT在現實場景中保留服裝細節和時間穩定性方面超越了現有方法。我們的項目頁面為https://virtu-lab.github.io/。
English
Video virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment. However, most existing end-to-end methods rely heavily on scarce paired garment-centric datasets and fail to effectively leverage priors of advanced visual models and test-time inputs, making it challenging to accurately preserve fine-grained garment details and maintain temporal consistency in unconstrained scenarios. To address these challenges, we propose DreamVVT, a carefully designed two-stage framework built upon Diffusion Transformers (DiTs), which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios. To further leverage prior knowledge from pretrained models and test-time inputs, in the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent keyframe try-on images. These images serve as complementary appearance guidance for subsequent video generation. In the second stage, skeleton maps together with fine-grained motion and appearance descriptions are extracted from the input content, and these along with the keyframe try-on images are then fed into a pretrained video generation model enhanced with LoRA adapters. This ensures long-term temporal coherence for unseen regions and enables highly plausible dynamic motions. Extensive quantitative and qualitative experiments demonstrate that DreamVVT surpasses existing methods in preserving detailed garment content and temporal stability in real-world scenarios. Our project page https://virtu-lab.github.io/
PDF132August 7, 2025