FashionChameleon: 실시간 및 상호작용적 인간-의상 비디오 맞춤화를 위하여

초록

인간 중심의 비디오 맞춤화, 특히 의상 수준에서의 맞춤화는 상당한 상업적 가치를 보여주고 있다. 그러나 기존 접근 방식은 전자상거래 및 콘텐츠 제작과 같은 응용 분야에서 중요한 저지연 및 대화형 의상 제어를 지원하지 못한다. 본 논문은 단일 의상 비디오 데이터만을 사용하여 움직임 일관성을 유지하면서 대화형 다중 의상 비디오 맞춤화를 달성하는 방법을 연구한다. 본 연구진은 자기회귀 비디오 생성에서 인간 의상 맞춤화를 위한 실시간 대화형 프레임워크인 FashionChameleon을 제안하며, 사용자는 생성 과정에서 대화형으로 의상을 전환할 수 있다. FashionChameleon은 세 가지 핵심 기술로 구성된다: (i) 다중 의상 비디오 데이터로 학습하는 대신, 단일 참조-의상 쌍에 대해 맥락 내 학습을 적용한 Teacher 모델을 학습시킨다. 이미지-비디오 학습 패러다임을 유지하면서 참조 이미지와 의상 이미지 간의 불일치를 강제함으로써, 모델은 단일 의상 전환 중에 암묵적으로 일관성을 유지하도록 유도된다. (ii) 생성 중 일관성과 효율성을 달성하기 위해 맥락 내 학습이 적용된 스트리밍 증류(Streaming Distillation with In-Context Learning)를 도입한다. 이는 맥락 내 교사 강제를 통해 모델을 미세 조정하고, 그래디언트 재가중 분포 매칭 증류를 통해 외삽 일관성을 개선한다. (iii) 대화형 다중 의상 비디오 맞춤화를 위해 모델을 확장하기 위해, 학습 없는 KV 캐시 재스케줄링(Training-Free KV Cache Rescheduling)을 제안한다. 이는 의상 KV 리프레시, 과거 KV 철회, 참조 KV 분리를 포함하여 움직임 일관성을 유지하면서 의상 전환을 구현한다. 제안하는 FashionChameleon은 대화형 맞춤화와 일관된 장기 비디오 외삽을 고유하게 지원하며, 단일 GPU에서 23.8 FPS의 실시간 생성을 달성하여 기존 기준 모델보다 30~180배 빠르다.

English

Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180times faster than existing baselines.