Voost: 양방향 가상 피팅 및 언피팅을 위한 통합적이고 확장 가능한 디퓨전 트랜스포머

초록

가상 피팅(Virtual try-on)은 특정 의상을 입은 사람의 현실적인 이미지를 합성하는 것을 목표로 하지만, 특히 자세와 외모 변화 하에서 의상과 신체 간의 정확한 대응을 모델링하는 것은 여전히 지속적인 과제로 남아 있습니다. 본 논문에서는 단일 디퓨전 트랜스포머(Diffusion Transformer)를 통해 가상 피팅과 가상 탈의(Try-off)를 통합적으로 학습하는 Voost라는 통합적이고 확장 가능한 프레임워크를 제안합니다. 두 작업을 함께 모델링함으로써, Voost는 각 의상-사람 쌍이 양방향을 모두 감독할 수 있도록 하며, 생성 방향과 의상 카테고리에 대한 유연한 조건 설정을 지원함으로써, 작업별 네트워크, 보조 손실 함수 또는 추가 레이블 없이도 의상과 신체 간의 관계적 추론을 강화합니다. 또한, 우리는 두 가지 추론 시 기술을 도입했습니다: 해상도 또는 마스크 변화에 대한 견고성을 위한 주의 온도 스케일링(Attention Temperature Scaling)과 작업 간의 양방향 일관성을 활용한 자기 수정 샘플링(Self-corrective Sampling)입니다. 광범위한 실험을 통해 Voost가 피팅 및 탈의 벤치마크에서 최첨단 결과를 달성하며, 정렬 정확도, 시각적 충실도 및 일반화 측면에서 강력한 베이스라인을 지속적으로 능가함을 입증했습니다.

English

Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.

Voost: 양방향 가상 피팅 및 언피팅을 위한 통합적이고 확장 가능한 디퓨전 트랜스포머

Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off

초록

Support