SEINE: 生成的な遷移と予測のための短編から長編へのビデオ拡散モデル

要旨

近年、ビデオ生成技術は現実的な結果をもたらす大きな進歩を遂げてきた。しかしながら、既存のAI生成ビデオは通常、単一のシーンを描いた非常に短いクリップ（「ショットレベル」）である。一貫性のある長いビデオ（「ストーリーレベル」）を提供するためには、異なるクリップ間の創造的なトランジションと予測効果が望ましい。本論文では、生成トランジションと予測に焦点を当てたショットからロングビデオへの拡散モデル、SEINEを紹介する。その目的は、シーン間の滑らかで創造的なトランジションと、さまざまな長さのショットレベルビデオを備えた高品質な長いビデオを生成することである。具体的には、テキスト記述に基づいて自動的にトランジションを生成するランダムマスクビデオ拡散モデルを提案する。異なるシーンの画像を入力として提供し、テキストベースの制御と組み合わせることで、我々のモデルは一貫性と視覚的品質を保証するトランジションビデオを生成する。さらに、このモデルは、画像からビデオへのアニメーションや自己回帰的ビデオ予測など、さまざまなタスクに容易に拡張可能である。この新しい生成タスクを包括的に評価するために、滑らかで創造的なトランジションのための3つの評価基準を提案する：時間的一貫性、意味的類似性、ビデオとテキストの意味的整合性である。広範な実験により、生成トランジションと予測における既存の手法に対する我々のアプローチの有効性が検証され、ストーリーレベルの長いビデオの作成が可能となった。プロジェクトページ: https://vchitect.github.io/SEINE-project/

English

Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level") depicting a single scene. To deliver a coherent long video ("story-level"), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos. Project page: https://vchitect.github.io/SEINE-project/ .

SEINE: 生成的な遷移と予測のための短編から長編へのビデオ拡散モデル

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

要旨

Support