アイデンティティとモーションの分離による被写体駆動型ビデオ生成

要旨

ゼロショット設定において追加のチューニングなしに、被写体固有の学習と時間的ダイナミクスを分離することで、被写体駆動のカスタマイズされたビデオ生成モデルをトレーニングすることを提案します。チューニング不要な従来のビデオカスタマイズ手法は、大規模な注釈付きビデオデータセットに依存することが多く、計算コストが高く、広範な注釈を必要とします。これに対して、我々はビデオカスタマイズモデルのトレーニングに直接画像カスタマイズデータセットを使用するアプローチを導入し、ビデオカスタマイズを二つの要素に分解します：(1) 画像カスタマイズデータセットを通じた同一性の注入、(2) 画像からビデオへのトレーニング手法を用いた、少量の未注釈ビデオによる時間的モデリングの維持。さらに、画像からビデオへのファインチューニング中にランダムな画像トークンのドロップとランダム化された画像初期化を採用し、コピー＆ペースト問題を軽減します。学習をさらに強化するため、被写体固有の特徴と時間的特徴の共同最適化中に確率的スイッチングを導入し、破滅的な忘却を緩和します。我々の手法は、被写体の一貫性とスケーラビリティを強く実現し、ゼロショット設定において既存のビデオカスタマイズモデルを上回り、本フレームワークの有効性を実証しています。

English

We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.

アイデンティティとモーションの分離による被写体駆動型ビデオ生成

Subject-driven Video Generation via Disentangled Identity and Motion

要旨

Support