VideoElevator：利用多功能的文本到圖像擴散模型提升視頻生成質量

摘要

文字到圖像擴散模型（T2I）展示了在創建逼真和美學圖像方面前所未有的能力。相反，文字到視頻擴散模型（T2V）在幀質和文字對齊方面仍遠遠落後，這歸因於訓練視頻的質量和數量不足。在本文中，我們介紹了VideoElevator，這是一種無需訓練且即插即用的方法，利用T2I的優越能力提升T2V的性能。與傳統的T2V抽樣（即時間和空間建模）不同，VideoElevator將每個抽樣步驟明確分解為時間運動精煉和空間質量提升。具體而言，時間運動精煉使用封裝的T2V來增強時間一致性，然後反轉為T2I所需的噪聲分佈。然後，空間質量提升利用膨脹的T2I來直接預測較少噪聲的潛在值，增加更多照片般逼真的細節。我們在各種T2V和T2I的組合下進行了大量提示的實驗。結果顯示，VideoElevator不僅改善了具有基礎T2I的T2V基準性能，還促進了具有個性化T2I的風格化視頻合成。我們的代碼可在https://github.com/YBYBZhang/VideoElevator找到。

English

Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Our code is available at https://github.com/YBYBZhang/VideoElevator.

VideoElevator：利用多功能的文本到圖像擴散模型提升視頻生成質量

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

摘要

Support