UltraGen:基於分層注意力機制的高解析度影片生成
UltraGen: High-Resolution Video Generation with Hierarchical Attention
October 21, 2025
作者: Teng Hu, Jiangning Zhang, Zihan Su, Ran Yi
cs.AI
摘要
近期視頻生成技術的進步,使得製作視覺效果引人入勝的視頻成為可能,這在內容創作、娛樂和虛擬現實等領域具有廣泛應用。然而,由於注意力機制在輸出寬度和高度上的二次計算複雜性,大多數現有的基於擴散變換器的視頻生成模型僅限於低分辨率輸出(<=720P)。這一計算瓶頸使得原生高分辨率視頻生成(1080P/2K/4K)在訓練和推理階段都變得不可行。為了解決這一挑戰,我們提出了UltraGen,這是一種新穎的視頻生成框架,能夠實現i)高效且ii)端到端的原生高分辨率視頻合成。具體而言,UltraGen採用了基於全局-局部注意力分解的層次化雙分支注意力架構,將全注意力解耦為一個局部注意力分支以實現高保真區域內容,以及一個全局注意力分支以確保整體語義一致性。我們進一步提出了一種空間壓縮的全局建模策略,以高效學習全局依賴關係,並提出了一種層次化跨窗口局部注意力機制,以在增強不同局部窗口間信息流動的同時降低計算成本。大量實驗表明,UltraGen首次能夠有效地將預訓練的低分辨率視頻模型擴展至1080P甚至4K分辨率,在質量和定量評估上均優於現有的最先進方法和基於超分辨率的兩階段流程。
English
Recent advances in video generation have made it possible to produce visually
compelling videos, with wide-ranging applications in content creation,
entertainment, and virtual reality. However, most existing diffusion
transformer based video generation models are limited to low-resolution outputs
(<=720P) due to the quadratic computational complexity of the attention
mechanism with respect to the output width and height. This computational
bottleneck makes native high-resolution video generation (1080P/2K/4K)
impractical for both training and inference. To address this challenge, we
present UltraGen, a novel video generation framework that enables i) efficient
and ii) end-to-end native high-resolution video synthesis. Specifically,
UltraGen features a hierarchical dual-branch attention architecture based on
global-local attention decomposition, which decouples full attention into a
local attention branch for high-fidelity regional content and a global
attention branch for overall semantic consistency. We further propose a
spatially compressed global modeling strategy to efficiently learn global
dependencies, and a hierarchical cross-window local attention mechanism to
reduce computational costs while enhancing information flow across different
local windows. Extensive experiments demonstrate that UltraGen can effectively
scale pre-trained low-resolution video models to 1080P and even 4K resolution
for the first time, outperforming existing state-of-the-art methods and
super-resolution based two-stage pipelines in both qualitative and quantitative
evaluations.