DreamVVT: 단계별 디퓨전 트랜스포머 프레임워크를 통해 야생 환경에서의 현실적인 비디오 가상 피팅 마스터하기

초록

비디오 가상 피팅(VVT) 기술은 전자상거래 광고 및 엔터테인먼트 분야에서의 유망한 응용 가능성으로 인해 상당한 학문적 관심을 받고 있습니다. 그러나 기존의 대부분의 종단 간(end-to-end) 방법들은 희소한 페어링된 의복 중심 데이터셋에 크게 의존하며, 고급 시각 모델과 테스트 시 입력 데이터의 사전 지식을 효과적으로 활용하지 못해 제약 없는 시나리오에서 미세한 의복 디테일을 정확하게 보존하고 시간적 일관성을 유지하는 데 어려움을 겪고 있습니다. 이러한 문제를 해결하기 위해, 우리는 Diffusion Transformers(DiTs)를 기반으로 한 두 단계 프레임워크인 DreamVVT를 제안합니다. 이 프레임워크는 다양한 페어링되지 않은 인간 중심 데이터를 활용하여 실세계 시나리오에서의 적응성을 향상시키는 데 본질적으로 능합니다. 사전 학습된 모델과 테스트 시 입력 데이터의 사전 지식을 더욱 효과적으로 활용하기 위해, 첫 번째 단계에서는 입력 비디오에서 대표 프레임을 샘플링하고, 비전-언어 모델(VLM)과 통합된 다중 프레임 피팅 모델을 사용하여 고해상도이고 의미적으로 일관된 키프레임 피팅 이미지를 합성합니다. 이러한 이미지는 후속 비디오 생성을 위한 보완적인 외관 가이드 역할을 합니다. 두 번째 단계에서는 입력 콘텐츠에서 스켈레톤 맵과 함께 미세한 동작 및 외관 설명을 추출하고, 이를 키프레임 피팅 이미지와 함께 LoRA 어댑터로 강화된 사전 학습된 비디오 생성 모델에 입력합니다. 이를 통해 보이지 않는 영역에 대한 장기적 시간적 일관성을 보장하고 매우 그럴듯한 동적 동작을 가능하게 합니다. 광범위한 정량적 및 정성적 실험을 통해 DreamVVT가 실세계 시나리오에서 의복 디테일을 보존하고 시간적 안정성을 유지하는 데 있어 기존 방법들을 능가함을 입증했습니다. 우리의 프로젝트 페이지는 https://virtu-lab.github.io/에서 확인할 수 있습니다.

English

Video virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment. However, most existing end-to-end methods rely heavily on scarce paired garment-centric datasets and fail to effectively leverage priors of advanced visual models and test-time inputs, making it challenging to accurately preserve fine-grained garment details and maintain temporal consistency in unconstrained scenarios. To address these challenges, we propose DreamVVT, a carefully designed two-stage framework built upon Diffusion Transformers (DiTs), which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios. To further leverage prior knowledge from pretrained models and test-time inputs, in the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent keyframe try-on images. These images serve as complementary appearance guidance for subsequent video generation. In the second stage, skeleton maps together with fine-grained motion and appearance descriptions are extracted from the input content, and these along with the keyframe try-on images are then fed into a pretrained video generation model enhanced with LoRA adapters. This ensures long-term temporal coherence for unseen regions and enables highly plausible dynamic motions. Extensive quantitative and qualitative experiments demonstrate that DreamVVT surpasses existing methods in preserving detailed garment content and temporal stability in real-world scenarios. Our project page https://virtu-lab.github.io/

DreamVVT: 단계별 디퓨전 트랜스포머 프레임워크를 통해 야생 환경에서의 현실적인 비디오 가상 피팅 마스터하기

DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework

초록

Support