FashionChameleon：邁向即時與互動的人體服裝影片定制

摘要

以人为中心的视频定制，尤其是在服装层面，已展现出显著的商业价值。然而，现有方法无法支持低延迟且可交互的服装控制，而这对于电子商务和内容创作等应用至关重要。本文研究如何仅利用单服装视频数据，在保持动作连贯性的前提下实现交互式多服装视频定制。我们提出FashionChameleon，这是一个用于自回归视频生成中实现实时交互式人服装定制的框架，用户可在生成过程中交互式切换服装。FashionChameleon包含三项关键技术：(i) 我们不采用多服装视频数据进行训练，而是通过上下文学习，在单参考服装对上训练教师模型。通过保留图像到视频的训练范式，同时强制参考图像与服装图像之间存在不匹配，模型被隐式地鼓励在单服装切换过程中保持连贯性。(ii) 为了实现生成过程中的一致性和高效性，我们引入带上下文学习的流式蒸馏，通过上下文教师强制机制微调模型，并利用梯度重加权分布匹配蒸馏提高外推一致性。(iii) 为了将模型扩展至交互式多服装视频定制，我们提出无训练KV缓存重调度方案，包括服装KV刷新、历史KV撤销和参考KV解耦，从而在保持动作连贯性的同时实现服装切换。我们的FashionChameleon独特地支持交互式定制和一致的长视频外推，同时在单GPU上实现23.8 FPS的实时生成，速度比现有基线方法快30-180倍。

English

Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180times faster than existing baselines.