SEINE：短至長視頻擴散模型，用於生成過渡和預測

摘要

最近，影片生成在實現逼真結果方面取得了顯著進展。然而，現有的人工智慧生成的影片通常是非常短的片段（"shot-level"），描繪單一場景。為了呈現一個連貫的長影片（"story-level"），希望能夠在不同片段之間實現創意過渡和預測效果。本文提出了一個短到長影片擴散模型，SEINE，專注於生成過渡和預測。其目標是生成具有流暢且具有創意過渡的高質量長影片，其中包括場景之間的平滑過渡和不同長度的shot-level影片。具體而言，我們提出了一個基於隨機遮罩的影片擴散模型，可以根據文本描述自動生成過渡。通過提供不同場景的圖像作為輸入，結合基於文本的控制，我們的模型生成確保連貫性和視覺質量的過渡影片。此外，該模型可以輕鬆擴展到各種任務，如圖像到影片動畫和自回歸影片預測。為了對這一新的生成任務進行全面評估，我們提出了三個評估標準，用於流暢和具有創意的過渡：時間一致性、語義相似性和影片-文本語義對齊。大量實驗驗證了我們的方法相對於現有的生成過渡和預測方法的有效性，從而實現了創建story-level長影片的可能性。項目頁面：https://vchitect.github.io/SEINE-project/。

English

Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level") depicting a single scene. To deliver a coherent long video ("story-level"), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos. Project page: https://vchitect.github.io/SEINE-project/ .

SEINE：短至長視頻擴散模型，用於生成過渡和預測

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

摘要

Support