ChatPaper.aiChatPaper

I4VGen:以图像为基础的文本到视频生成

I4VGen: Image as Stepping Stone for Text-to-Video Generation

June 4, 2024
作者: Xiefan Guo, Jinlin Liu, Miaomiao Cui, Di Huang
cs.AI

摘要

文本到视频生成在质量和多样性上落后于文本到图像合成,这是由于时空建模的复杂性和有限的视频文本数据集所致。本文提出了I4VGen,这是一个无需训练且即插即用的视频扩散推理框架,通过利用强大的图像技术来增强文本到视频的生成。具体而言,I4VGen将文本到视频生成分解为两个阶段:锚定图像合成和锚定图像引导的视频合成。相应地,采用了精心设计的生成-选择流程来实现视觉逼真且语义忠实的锚定图像,并结合了一种创新的噪声不变视频评分蒸馏采样,将图像转换为动态视频,随后进行视频再生过程以完善视频。这种推理策略有效地缓解了普遍存在的非零终端信噪比问题。广泛的评估表明,I4VGen不仅能够生成具有更高视觉逼真度和文本保真度的视频,还能够与现有的图像到视频扩散模型无缝集成,从而提高整体视频质量。
English
Text-to-video generation has lagged behind text-to-image synthesis in quality and diversity due to the complexity of spatio-temporal modeling and limited video-text datasets. This paper presents I4VGen, a training-free and plug-and-play video diffusion inference framework, which enhances text-to-video generation by leveraging robust image techniques. Specifically, following text-to-image-to-video, I4VGen decomposes the text-to-video generation into two stages: anchor image synthesis and anchor image-guided video synthesis. Correspondingly, a well-designed generation-selection pipeline is employed to achieve visually-realistic and semantically-faithful anchor image, and an innovative Noise-Invariant Video Score Distillation Sampling is incorporated to animate the image to a dynamic video, followed by a video regeneration process to refine the video. This inference strategy effectively mitigates the prevalent issue of non-zero terminal signal-to-noise ratio. Extensive evaluations show that I4VGen not only produces videos with higher visual realism and textual fidelity but also integrates seamlessly into existing image-to-video diffusion models, thereby improving overall video quality.
PDF183December 12, 2024