FantasyID：基於面部知識增強的ID保持視頻生成

摘要

無需調適的方法利用大規模預訓練的視頻擴散模型進行身份保持的文本到視頻生成（IPT2V）因其效能與可擴展性近期廣受歡迎。然而，在保持身份不變的同時實現令人滿意的面部動態仍面臨重大挑戰。在本研究中，我們提出了一種新穎的無需調適IPT2V框架，通過增強基於擴散變換器（DiT）構建的預訓練視頻模型的面部知識，命名為FantasyID。本質上，我們引入了3D面部幾何先驗，以確保視頻合成過程中面部結構的合理性。為防止模型學習簡單複製參考面部跨幀的“複製-粘貼”捷徑，我們設計了一種多視角面部增強策略，以捕捉多樣的2D面部外觀特徵，從而增加面部表情和頭部姿態的動態性。此外，在將2D與3D特徵融合作為引導後，我們並未簡單地使用交叉注意力將引導信息注入DiT層，而是採用了一種可學習的層感知自適應機制，選擇性地將融合特徵注入到各個DiT層中，促進身份保持與運動動態的平衡建模。實驗結果驗證了我們的模型在當前無需調適IPT2V方法中的優越性。

English

Tuning-free approaches adapting large-scale pre-trained video diffusion models for identity-preserving text-to-video generation (IPT2V) have gained popularity recently due to their efficacy and scalability. However, significant challenges remain to achieve satisfied facial dynamics while keeping the identity unchanged. In this work, we present a novel tuning-free IPT2V framework by enhancing face knowledge of the pre-trained video model built on diffusion transformers (DiT), dubbed FantasyID. Essentially, 3D facial geometry prior is incorporated to ensure plausible facial structures during video synthesis. To prevent the model from learning copy-paste shortcuts that simply replicate reference face across frames, a multi-view face augmentation strategy is devised to capture diverse 2D facial appearance features, hence increasing the dynamics over the facial expressions and head poses. Additionally, after blending the 2D and 3D features as guidance, instead of naively employing cross-attention to inject guidance cues into DiT layers, a learnable layer-aware adaptive mechanism is employed to selectively inject the fused features into each individual DiT layers, facilitating balanced modeling of identity preservation and motion dynamics. Experimental results validate our model's superiority over the current tuning-free IPT2V methods.

FantasyID：基於面部知識增強的ID保持視頻生成

FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation

摘要

Support