SEINE:短至长视频扩散模型,用于生成过渡和预测
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
October 31, 2023
作者: Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, Ziwei Liu
cs.AI
摘要
最近,视频生成在产生逼真结果方面取得了实质性进展。然而,现有的人工智能生成视频通常是非常短的片段(“镜头级”),描绘单个场景。为了呈现连贯的长视频(“故事级”),希望能够在不同片段之间实现创意过渡和预测效果。本文提出了一种短到长视频扩散模型SEINE,专注于生成过渡和预测。其目标是生成质量高且具有流畅且创意的场景过渡以及不同长度的镜头级视频的长视频。具体来说,我们提出了一种基于随机掩码的视频扩散模型,可根据文本描述自动生成过渡。通过提供不同场景的图像作为输入,并结合基于文本的控制,我们的模型生成确保连贯性和视觉质量的过渡视频。此外,该模型可以轻松扩展到各种任务,如图像到视频动画和自回归视频预测。为了对这一新的生成任务进行全面评估,我们提出了三个评估标准以评估流畅和创意的过渡:时间一致性、语义相似性和视频-文本语义对齐。大量实验证实了我们的方法相对于现有的生成过渡和预测方法的有效性,实现了故事级长视频的创作。项目页面:https://vchitect.github.io/SEINE-project/。
English
Recently video generation has achieved substantial progress with realistic
results. Nevertheless, existing AI-generated videos are usually very short
clips ("shot-level") depicting a single scene. To deliver a coherent long video
("story-level"), it is desirable to have creative transition and prediction
effects across different clips. This paper presents a short-to-long video
diffusion model, SEINE, that focuses on generative transition and prediction.
The goal is to generate high-quality long videos with smooth and creative
transitions between scenes and varying lengths of shot-level videos.
Specifically, we propose a random-mask video diffusion model to automatically
generate transitions based on textual descriptions. By providing the images of
different scenes as inputs, combined with text-based control, our model
generates transition videos that ensure coherence and visual quality.
Furthermore, the model can be readily extended to various tasks such as
image-to-video animation and autoregressive video prediction. To conduct a
comprehensive evaluation of this new generative task, we propose three
assessing criteria for smooth and creative transition: temporal consistency,
semantic similarity, and video-text semantic alignment. Extensive experiments
validate the effectiveness of our approach over existing methods for generative
transition and prediction, enabling the creation of story-level long videos.
Project page: https://vchitect.github.io/SEINE-project/ .