FancyVideo:通過跨幀文本引導實現動態且一致的視頻生成
FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance
August 15, 2024
作者: Jiasong Feng, Ao Ma, Jing Wang, Bo Cheng, Xiaodan Liang, Dawei Leng, Yuhui Yin
cs.AI
摘要
在人工智慧領域,合成動態豐富且時間一致的影片仍然是一個挑戰,特別是在處理較長時間範圍時。現有的文本轉影片(T2V)模型通常採用空間交叉注意力進行文本控制,等效地引導不同幀生成而無需特定於幀的文本引導。因此,模型理解提示中傳達的時間邏輯並生成具有連貫動作的影片的能力受到限制。為了應對這一限制,我們引入了FancyVideo,一個創新的影片生成器,通過精心設計的跨幀文本引導模組(CTGM)改進了現有的文本控制機制。具體而言,CTGM在交叉注意力的開始、中間和結尾分別整合了時間信息注入器(TII)、時間親和性調節器(TAR)和時間特徵增強器(TFB),以實現特定於幀的文本引導。首先,TII將來自潛在特徵的特定於幀資訊注入到文本條件中,從而獲得跨幀文本條件。然後,TAR在時間維度上精煉了跨幀文本條件和潛在特徵之間的相關矩陣。最後,TFB增強了潛在特徵的時間一致性。包括定量和定性評估的大量實驗證明了FancyVideo的有效性。我們的方法在EvalCrafter基準上實現了最先進的T2V生成結果,並促進了動態和一致影片的合成。影片展示結果可在https://fancyvideo.github.io/上獲得,我們將公開提供我們的程式碼和模型權重。
English
Synthesizing motion-rich and temporally consistent videos remains a challenge
in artificial intelligence, especially when dealing with extended durations.
Existing text-to-video (T2V) models commonly employ spatial cross-attention for
text control, equivalently guiding different frame generations without
frame-specific textual guidance. Thus, the model's capacity to comprehend the
temporal logic conveyed in prompts and generate videos with coherent motion is
restricted. To tackle this limitation, we introduce FancyVideo, an innovative
video generator that improves the existing text-control mechanism with the
well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM
incorporates the Temporal Information Injector (TII), Temporal Affinity Refiner
(TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of
cross-attention, respectively, to achieve frame-specific textual guidance.
Firstly, TII injects frame-specific information from latent features into text
conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines
the correlation matrix between cross-frame textual conditions and latent
features along the time dimension. Lastly, TFB boosts the temporal consistency
of latent features. Extensive experiments comprising both quantitative and
qualitative evaluations demonstrate the effectiveness of FancyVideo. Our
approach achieves state-of-the-art T2V generation results on the EvalCrafter
benchmark and facilitates the synthesis of dynamic and consistent videos. The
video show results can be available at https://fancyvideo.github.io/, and we
will make our code and model weights publicly available.Summary
AI-Generated Summary