Snap Video: テキストから動画生成のためのスケーラブル時空間トランスフォーマー

要旨

現代の画像生成モデルは、驚異的な品質と汎用性を示しています。これらの利点に影響を受け、研究コミュニティはそれらを動画生成に転用しています。動画コンテンツは高度に冗長であるため、画像モデルの進歩を単純に動画生成領域に持ち込むと、動きの忠実度や視覚的品質が低下し、スケーラビリティが損なわれると私たちは主張します。本論文では、これらの課題を体系的に解決する動画ファーストモデルであるSnap Videoを構築します。そのために、まずEDMフレームワークを拡張し、空間的および時間的に冗長なピクセルを考慮し、自然に動画生成をサポートします。次に、画像生成の主力であるU-Netが動画生成においてスケールが悪く、大幅な計算オーバーヘッドを必要とすることを示します。そこで、U-Netよりも3.31倍高速に学習し（推論時には約4.5倍高速）、新しいトランスフォーマーベースのアーキテクチャを提案します。これにより、初めて数十億のパラメータを持つテキストから動画へのモデルを効率的に学習し、多くのベンチマークで最先端の結果を達成し、大幅に高い品質、時間的一貫性、および動きの複雑さを持つ動画を生成することが可能になりました。ユーザー調査では、最新の手法に対して大幅に好まれる結果が示されました。詳細はhttps://snap-research.github.io/snapvideo/をご覧ください。

English

Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. See our website at https://snap-research.github.io/snapvideo/.

Snap Video: テキストから動画生成のためのスケーラブル時空間トランスフォーマー

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

要旨

Support