I4VGen:以影像為基礎的文本轉視頻生成
I4VGen: Image as Stepping Stone for Text-to-Video Generation
June 4, 2024
作者: Xiefan Guo, Jinlin Liu, Miaomiao Cui, Di Huang
cs.AI
摘要
文字轉視頻生成在質量和多樣性方面落後於文字轉圖像合成,這是由於時空建模的複雜性和有限的視頻文字數據集。本文提出了I4VGen,一種無需訓練並可即插即用的視頻擴散推理框架,通過利用強大的圖像技術來增強文字轉視頻生成。具體來說,在文字-圖像-視頻的基礎上,I4VGen將文字轉視頻生成分解為兩個階段:錨點圖像合成和錨點圖像引導的視頻合成。相應地,採用了一個精心設計的生成-選擇管道來實現具有視覺逼真性和語義忠實性的錨點圖像,並且還將一種創新的噪聲不變視頻分數蒸餾採樣方法融入其中,將圖像動畫化為動態視頻,然後進行視頻再生過程以完善視頻。這種推理策略有效地緩解了普遍存在的非零終端信噪比問題。廣泛的評估顯示,I4VGen不僅能夠生成具有更高視覺逼真度和文本忠實度的視頻,還能夠無縫集成到現有的圖像-視頻擴散模型中,從而提高整體視頻質量。
English
Text-to-video generation has lagged behind text-to-image synthesis in quality
and diversity due to the complexity of spatio-temporal modeling and limited
video-text datasets. This paper presents I4VGen, a training-free and
plug-and-play video diffusion inference framework, which enhances text-to-video
generation by leveraging robust image techniques. Specifically, following
text-to-image-to-video, I4VGen decomposes the text-to-video generation into two
stages: anchor image synthesis and anchor image-guided video synthesis.
Correspondingly, a well-designed generation-selection pipeline is employed to
achieve visually-realistic and semantically-faithful anchor image, and an
innovative Noise-Invariant Video Score Distillation Sampling is incorporated to
animate the image to a dynamic video, followed by a video regeneration process
to refine the video. This inference strategy effectively mitigates the
prevalent issue of non-zero terminal signal-to-noise ratio. Extensive
evaluations show that I4VGen not only produces videos with higher visual
realism and textual fidelity but also integrates seamlessly into existing
image-to-video diffusion models, thereby improving overall video quality.Summary
AI-Generated Summary