VideoCrafter2：克服資料限制以達到高品質視訊擴散模型

摘要

文本轉視頻生成旨在根據給定提示生成視頻。最近，幾種商業視頻模型已能夠生成合理的視頻，具有最小的噪音、出色的細節和高美學分數。然而，這些模型依賴於大規模、經過良好過濾的高質量視頻，這些視頻對社區不可及。許多現有的研究作品使用低質量的WebVid-10M數據集來訓練模型，因為這些模型被優化以適應WebVid-10M，所以難以生成高質量的視頻。在這項工作中，我們探索了從穩定擴散延伸的視頻模型的訓練方案，並研究了利用低質量視頻和合成高質量圖像來獲得高質量視頻模型的可行性。我們首先分析了視頻模型的空間和時間模塊之間的關聯以及到低質量視頻的分布轉移。我們觀察到，對所有模塊進行完整訓練導致空間和時間模塊之間的耦合比僅訓練時間模塊更強。基於這種更強的耦合，通過用高質量圖像微調空間模塊，將分布轉移到更高質量而不會出現運動劣化，從而產生一個通用的高質量視頻模型。進行評估以證明所提方法的優越性，特別是在圖像質量、運動和概念組成方面。

English

Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition.

VideoCrafter2：克服資料限制以達到高品質視訊擴散模型

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

摘要

Support