LinGen: 高解像度の1分間テキストからビデオへの生成に向けて、線形計算複雑性を持つ

要旨

ビデオ生成のテキストはコンテンツ作成を向上させますが、計算量が非常に多くかかります。拡散トランスフォーマー（DiTs）の計算コストはピクセル数の二乗に比例します。これにより、短いビデオの生成は非常に高額となり、既存のほとんどのモデルが10〜20秒のビデオ生成に限定されています。私たちは、ピクセル数に比例して線形にスケーリングするLinear-complexity text-to-video Generation（LinGen）フレームワークを提案します。LinGenは、高解像度の長いビデオ生成を単一のGPUで犠牲にすることなく実現します。これは、計算的に支配的で二次計算量のセルフアテンションブロックを、MA-branchとTE-branchからなる線形計算量のMATEブロックに置き換えます。MA-branchは、短距離から長距離の相関を対象とし、双方向のMamba2ブロックと、長いビデオ生成のために開発されたトークン再配置手法であるRotary Major Scan、およびレビュートークンを組み合わせます。TE-branchは、隣接トークンと中距離トークン間の時間的相関に焦点を当てた新しいTEmporal Swin Attentionブロックです。MATEブロックは、Mambaの隣接保存の問題を解決し、生成されたビデオの一貫性を大幅に向上させます。実験結果によると、LinGenは、ビデオ品質においてDiTを上回り（勝率75.6%）、FLOPs（遅延）を最大15倍（11.5倍）削減します。さらに、自動メトリクスと人間の評価の両方が、LinGen-4Bが最先端のモデル（Gen-3、LumaLabs、Klingに対してそれぞれ50.5%、52.1%、49.1%の勝率）と同等のビデオ品質を提供することを示しています。これは、長時間の映画生成やリアルタイムのインタラクティブビデオ生成への道を開きます。プロジェクトウェブサイトで68秒のビデオ生成結果やその他の例を提供しています：https://lineargen.github.io/。

English

Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15times (11.5times) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: https://lineargen.github.io/.

LinGen: 高解像度の1分間テキストからビデオへの生成に向けて、線形計算複雑性を持つ

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

要旨

Support