VideoTetris：邁向組合式文本到視頻生成

摘要

擴散模型在文本到視頻（T2V）生成方面取得了巨大成功。然而，現有方法在處理涉及多個物件或物件數量動態變化的複雜（長）視頻生成場景時可能會面臨挑戰。為了解決這些限制，我們提出了VideoTetris，一個新穎的框架，可以實現組合式T2V生成。具體來說，我們提出了時空組合式擴散，通過在空間和時間上操作和組合去噪網絡的注意力地圖，以精確地遵循複雜的文本語義。此外，我們提出了增強的視頻數據預處理，以增強關於運動動態和及時理解的訓練數據，配備了一種新的參考幀注意機制，以提高自回歸視頻生成的一致性。大量實驗表明，我們的VideoTetris在組合式T2V生成方面取得了令人印象深刻的定性和定量結果。代碼可在以下鏈接中找到：https://github.com/YangLing0818/VideoTetris

English

Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in object numbers. To address these limitations, we propose VideoTetris, a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation. Code is available at: https://github.com/YangLing0818/VideoTetris

VideoTetris：邁向組合式文本到視頻生成

VideoTetris: Towards Compositional Text-to-Video Generation

摘要

Support