VideoCrafter2：克服数据限制以实现高质量视频扩散模型

摘要

文本到视频生成旨在根据给定提示生成视频。最近，几种商用视频模型能够生成合理的视频，具有最小的噪音、出色的细节和高美学评分。然而，这些模型依赖于大规模、经过良好过滤、高质量的视频，这些视频对社区不可见。许多现有研究作品使用低质量的WebVid-10M数据集训练模型，很难生成高质量的视频，因为这些模型是针对WebVid-10M进行优化的。在这项工作中，我们探讨了从稳定扩散扩展的视频模型的训练方案，并调查了利用低质量视频和合成高质量图像来获得高质量视频模型的可行性。我们首先分析了视频模型的空间和时间模块之间的关联以及向低质量视频的分布转变。我们观察到，训练所有模块会导致空间和时间模块之间的耦合比仅训练时间模块更强。基于这种更强的耦合，我们通过用高质量图像微调空间模块将分布转变为更高质量，而不会出现运动退化，从而产生通用高质量视频模型。我们进行评估以展示所提出方法的优越性，特别是在图片质量、动作和概念构成方面。

English

Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition.

VideoCrafter2：克服数据限制以实现高质量视频扩散模型

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

摘要

Support