FantasyID: 顔知識を強化したID保存型ビデオ生成

要旨

大規模な事前学習済みビデオ拡散モデルをアイデンティティ保存型テキスト-to-ビデオ生成（IPT2V）に適応させるためのチューニング不要なアプローチは、その有効性と拡張性から近年注目を集めている。しかし、アイデンティティを維持しつつ満足のいく顔のダイナミクスを実現するためには、依然として大きな課題が残されている。本研究では、拡散トランスフォーマー（DiT）に基づく事前学習済みビデオモデルの顔知識を強化した新たなチューニング不要なIPT2Vフレームワーク、FantasyIDを提案する。本質的に、3D顔形状の事前情報を組み込むことで、ビデオ合成中に妥当な顔構造を保証する。モデルが単に参照顔をフレーム間で複製するコピー＆ペーストのショートカットを学習するのを防ぐため、多視点顔拡張戦略を考案し、多様な2D顔外観特徴を捉えることで、表情や頭部姿勢のダイナミクスを向上させる。さらに、2Dおよび3D特徴をガイダンスとしてブレンドした後、DiT層にガイダンス情報を注入するために単純にクロスアテンションを使用するのではなく、学習可能な層対応適応機構を採用し、融合された特徴を各DiT層に選択的に注入することで、アイデンティティ保存とモーションダイナミクスのバランスの取れたモデリングを促進する。実験結果は、本モデルが現行のチューニング不要なIPT2V手法を凌駕することを実証している。

English

Tuning-free approaches adapting large-scale pre-trained video diffusion models for identity-preserving text-to-video generation (IPT2V) have gained popularity recently due to their efficacy and scalability. However, significant challenges remain to achieve satisfied facial dynamics while keeping the identity unchanged. In this work, we present a novel tuning-free IPT2V framework by enhancing face knowledge of the pre-trained video model built on diffusion transformers (DiT), dubbed FantasyID. Essentially, 3D facial geometry prior is incorporated to ensure plausible facial structures during video synthesis. To prevent the model from learning copy-paste shortcuts that simply replicate reference face across frames, a multi-view face augmentation strategy is devised to capture diverse 2D facial appearance features, hence increasing the dynamics over the facial expressions and head poses. Additionally, after blending the 2D and 3D features as guidance, instead of naively employing cross-attention to inject guidance cues into DiT layers, a learnable layer-aware adaptive mechanism is employed to selectively inject the fused features into each individual DiT layers, facilitating balanced modeling of identity preservation and motion dynamics. Experimental results validate our model's superiority over the current tuning-free IPT2V methods.

FantasyID: 顔知識を強化したID保存型ビデオ生成

FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation

要旨

Support