Vanast：基于合成三元组监督的人体图像动画虚拟试穿技术

摘要

我们提出Vanast这一统一框架，能够直接从单张人物图像、服装图像和姿态引导视频生成服装转换后的人物动画视频。传统两阶段流程将基于图像的虚拟试穿和姿态驱动动画视为独立过程，这往往会导致身份特征漂移、服装形变及正反面不一致等问题。我们的模型通过单步统一完成整个流程来实现连贯合成，从而解决这些难题。为实现这一目标，我们构建了大规模三元组监督数据：数据生成流程包括生成与服装目录图像不同的替代着装方案且保持身份特征的人物图像，采集包含上下装完整服饰的三元组以突破单服装-姿态视频对的限制，以及无需服装目录图像即可组装多样化的真实场景三元组。我们进一步引入视频扩散变压器的双模块架构，通过稳定训练、保持预训练生成质量，在支持零样本服装插值的同时提升服装精度、姿态贴合度与身份特征保持能力。这些创新共同使Vanast能够跨多种服装类型生成高保真、身份一致的人物动画。

English

We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.

Vanast：基于合成三元组监督的人体图像动画虚拟试穿技术

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

摘要

Support