iTryOn: 공간-의미적 안내를 통한 대화형 비디오 가상 착용의 마스터링

초록

비디오 가상 피팅(VVT)은 동영상 속 인물의 의상을 새로운 옷으로 매끄럽게 대체하는 것을 목표로 한다. 기존 방법들은 시간적 일관성을 유지하는 데 상당한 진전을 이루었지만, 주로 모델이 단순히 의상을 전시하는 비대화형 시나리오에 국한되어 있다. 이러한 한계는 실제 의상 전시의 핵심적인 측면인 능동적인 인간-의복 상호작용을 간과한다. 이러한 격차를 해소하기 위해, 우리는 동영상 속 대상이 의복과 적극적으로 상호작용하는 새로운 도전적 과제인 인터랙티브 비디오 가상 피팅(Interactive VVT)을 소개하고 공식화한다. 이 과제는 단순한 질감 보존을 넘어 다음과 같은 독특한 도전 과제를 제기한다: (1) 표준 포즈 정보로부터 상호작용의 의미적 모호성 해소, (2) 상호작용 순간이 드물고 짧은 비디오에서 복잡한 의복 변형 학습. 이러한 과제를 해결하기 위해, 우리는 대규모 비디오 확산 트랜스포머를 기반으로 구축된 새로운 프레임워크인 iTryOn을 제안한다. iTryOn은 다중 수준 상호작용 주입 메커니즘을 개척하여 복잡한 동역학 생성을 안내한다. 공간 수준에서는 의복에 무관한 3D 손 사전 정보를 도입하여 손과 의복 간의 정밀한 접촉에 대한 세분화된 안내를 제공함으로써 공간적 모호성을 효과적으로 해결한다. 의미 수준에서는 iTryOn이 전체 맥락을 위한 전역 캡션과 국소적 상호작용을 위한 시간 스탬프가 찍힌 동작 캡션을 활용하며, 이는 새로운 동작 인식 회전 위치 임베딩(A-RoPE)을 통해 동기화된다. 광범위한 실험을 통해 iTryOn이 기존 VVT 벤치마크에서 최첨단 성능을 달성할 뿐만 아니라 새로운 인터랙티브 설정에서도 확고한 우위를 확립하여, 보다 역동적이고 제어 가능한 가상 피팅 경험을 향한 중요한 진전을 이루었음을 입증한다.

English

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.