UniVG：邁向統一模態視頻生成

摘要

基於擴散的視頻生成在學術界和工業界都受到廣泛關注並取得相當成功。然而，目前的努力主要集中在單一目標或單一任務的視頻生成，例如由文本驅動的生成、由圖像驅動的生成，或由文本和圖像組合驅動的生成。這無法完全滿足真實應用場景的需求，因為用戶可能以靈活的方式輸入圖像和文本條件，可以是單獨輸入，也可以是組合輸入。為了應對這一挑戰，我們提出了一個統一模態視頻生成系統，能夠處理跨文本和圖像模態的多任務視頻生成。為此，我們從性能自由的角度重新審視我們系統中的各種視頻生成任務，並將它們分為高自由度和低自由度視頻生成類別。對於高自由度的視頻生成，我們採用多條件交叉注意力來生成與輸入圖像或文本語義對齊的視頻。對於低自由度的視頻生成，我們引入偏置高斯噪聲來替代純隨機高斯噪聲，有助於更好地保留輸入條件的內容。我們的方法在公共學術基準MSR-VTT上實現了最低的Fr\'echet視頻距離（FVD），在人類評估方面超越了當前的開源方法，並與當前的封閉源方法Gen2不相上下。更多樣本，請訪問https://univg-baidu.github.io。

English

Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concentrated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individually or in combination. To address this, we propose a Unified-modal Video Genearation system that is capable of handling multiple video generation tasks across text and image modalities. To this end, we revisit the various video generation tasks within our system from the perspective of generative freedom, and classify them into high-freedom and low-freedom video generation categories. For high-freedom video generation, we employ Multi-condition Cross Attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace the pure random Gaussian Noise, which helps to better preserve the content of the input conditions. Our method achieves the lowest Fr\'echet Video Distance (FVD) on the public academic benchmark MSR-VTT, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2. For more samples, visit https://univg-baidu.github.io.

UniVG：邁向統一模態視頻生成

UniVG: Towards UNIfied-modal Video Generation

摘要

Support