I2VGen-XL：カスケード拡散モデルによる高品質な画像-動画合成

要旨

ビデオ合成は、拡散モデルの急速な発展により、最近目覚ましい進歩を遂げています。しかし、セマンティックな正確性、明瞭性、および時空間的連続性の面で依然として課題に直面しています。これらの課題は、主に、整列したテキスト-ビデオデータの不足と、ビデオの複雑な内在的構造に起因しており、モデルがセマンティックな品質と質的な卓越性を同時に確保することが困難となっています。本報告では、これら2つの要素を分離し、静的な画像を重要なガイダンスとして活用することで入力データの整合性を確保する、カスケード型のI2VGen-XLアプローチを提案します。I2VGen-XLは2つの段階で構成されています：i) ベース段階では、2つの階層型エンコーダを使用して、入力画像からの一貫したセマンティクスとコンテンツを保証し、ii) リファインメント段階では、追加の簡潔なテキストを組み込み、解像度を1280×720に向上させることでビデオの詳細を強化します。多様性を向上させるために、約3500万のシングルショットテキスト-ビデオペアと60億のテキスト-画像ペアを収集し、モデルを最適化しました。これにより、I2VGen-XLは生成されたビデオのセマンティックな正確性、詳細の連続性、および明瞭性を同時に向上させることができます。広範な実験を通じて、I2VGen-XLの基本原理を調査し、現在のトップメソッドと比較することで、多様なデータに対するその有効性を実証しました。ソースコードとモデルはhttps://i2vgen-xl.github.ioで公開されます。

English

Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280times720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at https://i2vgen-xl.github.io.

I2VGen-XL：カスケード拡散モデルによる高品質な画像-動画合成

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

要旨

Support