FSVideo：高圧縮潜在空間における高速ビデオ拡散モデル

要旨

本論文では、高速動作が可能なTransformerベースの画像-動画（I2V）拡散フレームワーク「FSVideo」を提案する。本フレームワークは以下の主要コンポーネントに基づいて構築されている：1）高い圧縮率を実現した新しい動画オートエンコーダ（時空間ダウンサンプリング比64×64×4）により、優れた再構成品質を達成；2）層間情報フローとコンテキスト再利用を強化する新たなメモリ設計を導入した拡散Transformer（DIT）アーキテクチャ；3）動画の精細度向上のための、少数ステップによるDITアップサンプラを用いたマルチ解像度生成戦略。14BパラメータのDITベースモデルと14BパラメータのDITアップサンプラで構成される最終モデルは、他の主要なオープンソースモデルと競合する性能を達成しつつ、一桁高速な処理を実現している。本報告ではモデル設計と訓練戦略についても論じる。

English

We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space (64times64times4 spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.

FSVideo：高圧縮潜在空間における高速ビデオ拡散モデル

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

要旨

Support