搜索先驗使文本到視頻合成更好。
Searching Priors Makes Text-to-Video Synthesis Better
June 5, 2024
作者: Haoran Cheng, Liang Peng, Linxuan Xia, Yuepeng Hu, Hengjia Li, Qinglin Lu, Xiaofei He, Boxi Wu
cs.AI
摘要
在影片傳播模型方面取得的重大進展為文本到影片(T2V)合成領域帶來了顯著進步。然而,現有的T2V合成模型在準確生成複雜運動動態方面存在困難,導致影片的真實感降低。一種可能的解決方案是收集大量數據並對模型進行訓練,但這將非常昂貴。為了緩解這個問題,在本文中,我們重新制定了典型的T2V生成過程,將其作為基於搜索的生成流程。我們不是擴大模型訓練,而是利用現有影片作為運動先驗數據庫。具體來說,我們將T2V生成過程分為兩個步驟:(i)對於給定的提示輸入,我們搜索現有的文本-影片數據集,找到與提示運動密切匹配的帶有文本標籤的影片。我們提出了一種定制的搜索算法,強調物體運動特徵。(ii)檢索到的影片被處理並提煉為運動先驗,以微調預訓練的基礎T2V模型,然後使用輸入提示生成所需的影片。通過利用從搜索到的影片中獲取的先驗,我們增強了生成影片運動的真實感。所有操作都可以在一個單獨的NVIDIA RTX 4090 GPU上完成。我們對各種提示輸入的方法進行了與最先進的T2V模型的驗證。代碼將會公開。
English
Significant advancements in video diffusion models have brought substantial
progress to the field of text-to-video (T2V) synthesis. However, existing T2V
synthesis model struggle to accurately generate complex motion dynamics,
leading to a reduction in video realism. One possible solution is to collect
massive data and train the model on it, but this would be extremely expensive.
To alleviate this problem, in this paper, we reformulate the typical T2V
generation process as a search-based generation pipeline. Instead of scaling up
the model training, we employ existing videos as the motion prior database.
Specifically, we divide T2V generation process into two steps: (i) For a given
prompt input, we search existing text-video datasets to find videos with text
labels that closely match the prompt motions. We propose a tailored search
algorithm that emphasizes object motion features. (ii) Retrieved videos are
processed and distilled into motion priors to fine-tune a pre-trained base T2V
model, followed by generating desired videos using input prompt. By utilizing
the priors gleaned from the searched videos, we enhance the realism of the
generated videos' motion. All operations can be finished on a single NVIDIA RTX
4090 GPU. We validate our method against state-of-the-art T2V models across
diverse prompt inputs. The code will be public.Summary
AI-Generated Summary