FashionChameleon：面向实时交互式人体服装视频定制

摘要

以人为中心的视频定制，特别是在服装层面，已展现出显著的商业价值。然而，现有方法无法支持低延迟、交互式的服装控制，而这对于电子商务和内容创作等应用至关重要。本文研究如何仅利用单件服装视频数据，在保持运动一致性的同时实现交互式多服装视频定制。我们提出FashionChameleon，一种面向自回归视频生成的实时交互式人物服装定制框架，用户可在生成过程中交互切换服装。FashionChameleon包含三项关键技术：(i) 我们不在多服装视频数据上训练，而是通过上下文学习在单参考服装对的基础上训练教师模型。通过保留图像到视频的训练范式，同时强制参考图像与服装图像不匹配，鼓励模型在单件服装切换中隐式保持一致性。(ii) 为实现生成过程的一致性与高效性，我们引入基于上下文学习的流式蒸馏技术，通过上下文教师强制微调模型，并利用梯度重加权分布匹配蒸馏改善外推一致性。(iii) 为将模型扩展至交互式多服装视频定制，我们提出免训练的KV缓存重调度方法，包括服装KV刷新、历史KV撤销和参考KV解耦，在保持运动一致性的同时实现服装切换。我们的FashionChameleon独特地支持交互式定制与一致的长视频外推，在单GPU上以23.8 FPS实现实时生成，比现有基线快30-180倍。

English

Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180times faster than existing baselines.