Stand-In: ビデオ生成のための軽量でプラグアンドプレイなアイデンティティ制御

要旨

ユーザー指定のIDに一致する高精細な人間の動画を生成することは、生成AIの分野において重要でありながらも困難な課題です。既存の手法は、過剰な数の学習パラメータに依存しており、他のAIGCツールとの互換性に欠けることが多いです。本論文では、動画生成におけるID保存のための軽量でプラグアンドプレイ可能なフレームワーク「Stand-In」を提案します。具体的には、事前学習済みの動画生成モデルに条件付き画像ブランチを導入します。ID制御は、条件付き位置マッピングを用いた制限付きセルフアテンションによって実現され、わずか2000ペアのデータで迅速に学習可能です。追加パラメータをわずか約1%しか組み込まずに訓練するにもかかわらず、本フレームワークは動画品質とID保存において優れた結果を達成し、他のフルパラメータ訓練手法を上回ります。さらに、本フレームワークは、被写体駆動動画生成、ポーズ参照動画生成、スタイライゼーション、フェイススワッピングなどの他のタスクにもシームレスに統合可能です。

English

Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just sim1\% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.

Stand-In: ビデオ生成のための軽量でプラグアンドプレイなアイデンティティ制御

Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

要旨

Support