FashionChameleon: リアルタイムかつインタラクティブな人物衣服動画カスタマイズに向けて

要旨

人間中心のビデオカスタマイゼーション、特に衣類レベルでのカスタマイズは、商業的に大きな価値を示している。しかし、既存の手法では、電子商取引やコンテンツ制作といった応用において重要となる、低遅延かつインタラクティブな衣類制御を実現できない。本論文では、単一の衣類ビデオデータのみを用いて、動作の一貫性を保ちながらインタラクティブな複数衣類ビデオカスタマイゼーションを達成する方法を研究する。我々はFashionChameleonを提案する。これは、自己回帰型ビデオ生成における人間の衣類カスタマイゼーションのためのリアルタイムかつインタラクティブなフレームワークであり、ユーザーは生成中にインタラクティブに衣類を切り替えることができる。FashionChameleonは以下の3つの主要技術から構成される。(i) 複数衣類ビデオデータでの学習ではなく、単一の参照衣類ペアを用いたインコンテキスト学習により教師モデルを訓練する。画像からビデオへの学習パラダイムを維持しつつ、参照画像と衣類画像の間にミスマッチを強制することで、モデルは単一衣類切り替え時に暗黙的に一貫性を保つように促される。(ii) 生成中の一貫性と効率性を達成するため、インコンテキスト学習を用いたストリーミング蒸留を導入する。これは、インコンテキストの教師強制によってモデルを微調整し、勾配再重み付け分布マッチング蒸留により外挿の一貫性を向上させる。(iii) モデルをインタラクティブな複数衣類ビデオカスタマイゼーションに拡張するため、学習不要のKVキャッシュ再スケジューリングを提案する。これには、衣類KVのリフレッシュ、過去のKVの引き出し、参照KVの分離が含まれ、動作の一貫性を保ちながら衣類切り替えを実現する。我々のFashionChameleonは、インタラクティブなカスタマイゼーションと一貫性のある長尺ビデオ外挿を独自にサポートし、単一GPU上で23.8 FPSのリアルタイム生成を達成する。これは既存のベースラインと比較して30～180倍高速である。

English

Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180times faster than existing baselines.