UltraGen：階層的注意機構を活用した高解像度動画生成

要旨

近年のビデオ生成技術の進歩により、視覚的に魅力的なビデオを生成することが可能となり、コンテンツ制作、エンターテイメント、仮想現実など幅広い応用が期待されています。しかし、既存の拡散トランスフォーマーベースのビデオ生成モデルのほとんどは、出力幅と高さに対するアテンションメカニズムの二次計算複雑性のため、低解像度の出力（<=720P）に限定されています。この計算上のボトルネックにより、ネイティブの高解像度ビデオ生成（1080P/2K/4K）は、トレーニングと推論の両方において非現実的となっています。この課題に対処するため、我々はUltraGenを提案します。これは、i)効率的で、ii)エンドツーエンドのネイティブ高解像度ビデオ合成を可能にする新しいビデオ生成フレームワークです。具体的には、UltraGenは、グローバル-ローカルアテンション分解に基づく階層型デュアルブランチアテンションアーキテクチャを特徴とし、フルアテンションを高忠実度の地域コンテンツのためのローカルアテンションブランチと、全体的な意味的一貫性のためのグローバルアテンションブランチに分離します。さらに、グローバル依存関係を効率的に学習するための空間圧縮グローバルモデリング戦略と、異なるローカルウィンドウ間の情報フローを強化しながら計算コストを削減する階層型クロスウィンドウローカルアテンションメカニズムを提案します。大規模な実験により、UltraGenが事前学習済みの低解像度ビデオモデルを初めて1080Pおよび4K解像度に効果的にスケーリングできることが示され、質的および量的評価の両方において、既存の最先端手法や超解像ベースの二段階パイプラインを凌駕することが実証されました。

English

Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (<=720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.

UltraGen：階層的注意機構を活用した高解像度動画生成

UltraGen: High-Resolution Video Generation with Hierarchical Attention

要旨

Support