UniVG：走向统一模态视频生成

摘要

基于扩散的视频生成在学术界和工业界都受到了广泛关注，并取得了相当大的成功。然而，目前的努力主要集中在单目标或单任务视频生成上，例如由文本驱动的生成，由图像驱动的生成，或者由文本和图像的组合驱动的生成。这不能完全满足现实应用场景的需求，因为用户可能以灵活的方式输入图像和文本条件，可以是单独输入，也可以是组合输入。为了解决这个问题，我们提出了一个统一模态视频生成系统，能够处理跨文本和图像模态的多个视频生成任务。为此，我们从生成自由度的角度重新审视系统内的各种视频生成任务，并将它们分类为高自由度和低自由度视频生成类别。对于高自由度视频生成，我们采用多条件交叉注意力来生成与输入图像或文本语义对齐的视频。对于低自由度视频生成，我们引入偏置高斯噪声来替代纯随机高斯噪声，有助于更好地保留输入条件的内容。我们的方法在公共学术基准MSR-VTT上实现了最低的Fr\'echet视频距离（FVD），在人类评估方面超越了当前的开源方法，并与当前的闭源方法Gen2不相上下。更多样本，请访问https://univg-baidu.github.io。

English

Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Genearation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ Multi-condition Cross Attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace the pure random Gaussian Noise, which helps to better preserve the content of the input conditions. Our method achieves the lowest Fr\'echet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2. For more samples, visit https://univg-baidu.github.io.

UniVG：走向统一模态视频生成

UniVG: Towards UNIfied-modal Video Generation

摘要

Support