搜索先验使文本到视频合成变得更好。

摘要

视频扩散模型的显著进展为文本到视频（T2V）合成领域带来了重大进展。然而，现有的T2V合成模型在准确生成复杂运动动态方面存在困难，导致视频真实性降低。一种可能的解决方案是收集大量数据并对模型进行训练，但这将非常昂贵。为了缓解这一问题，在本文中，我们将典型的T2V生成过程重新构建为基于搜索的生成管道。我们不是扩大模型训练，而是利用现有视频作为运动先验数据库。具体而言，我们将T2V生成过程分为两个步骤：（i）对于给定的提示输入，我们搜索现有的文本-视频数据集，以找到与提示运动密切匹配的带文本标签的视频。我们提出了一种强调对象运动特征的定制搜索算法。（ii）检索到的视频被处理并提炼为运动先验，以微调预训练的基础T2V模型，随后使用输入提示生成所需的视频。通过利用从搜索视频中获取的先验，我们增强了生成视频运动的真实感。所有操作均可在单个NVIDIA RTX 4090 GPU上完成。我们针对各种提示输入验证了我们的方法与最先进的T2V模型。代码将会公开。

English

Significant advancements in video diffusion models have brought substantial progress to the field of text-to-video (T2V) synthesis. However, existing T2V synthesis model struggle to accurately generate complex motion dynamics, leading to a reduction in video realism. One possible solution is to collect massive data and train the model on it, but this would be extremely expensive. To alleviate this problem, in this paper, we reformulate the typical T2V generation process as a search-based generation pipeline. Instead of scaling up the model training, we employ existing videos as the motion prior database. Specifically, we divide T2V generation process into two steps: (i) For a given prompt input, we search existing text-video datasets to find videos with text labels that closely match the prompt motions. We propose a tailored search algorithm that emphasizes object motion features. (ii) Retrieved videos are processed and distilled into motion priors to fine-tune a pre-trained base T2V model, followed by generating desired videos using input prompt. By utilizing the priors gleaned from the searched videos, we enhance the realism of the generated videos' motion. All operations can be finished on a single NVIDIA RTX 4090 GPU. We validate our method against state-of-the-art T2V models across diverse prompt inputs. The code will be public.

搜索先验使文本到视频合成变得更好。

Searching Priors Makes Text-to-Video Synthesis Better

摘要

Support