ChatPaper.aiChatPaper

UltraGen:基于层次化注意力机制的高清视频生成

UltraGen: High-Resolution Video Generation with Hierarchical Attention

October 21, 2025
作者: Teng Hu, Jiangning Zhang, Zihan Su, Ran Yi
cs.AI

摘要

近期视频生成技术的进步使得制作视觉上引人入胜的视频成为可能,这些技术在内容创作、娱乐和虚拟现实等领域有着广泛的应用。然而,由于注意力机制在输出宽度和高度上的二次计算复杂度,大多数现有的基于扩散变换器的视频生成模型仅限于低分辨率输出(<=720P)。这一计算瓶颈使得原生高分辨率视频生成(1080P/2K/4K)在训练和推理中都变得不切实际。为解决这一挑战,我们提出了UltraGen,一种新颖的视频生成框架,能够实现i)高效且ii)端到端的原生高分辨率视频合成。具体而言,UltraGen采用了一种基于全局-局部注意力分解的分层双分支注意力架构,将完整注意力解耦为用于高保真区域内容的局部注意力分支和用于整体语义一致性的全局注意力分支。我们进一步提出了一种空间压缩的全局建模策略,以高效学习全局依赖关系,以及一种分层跨窗口局部注意力机制,在增强不同局部窗口间信息流动的同时降低计算成本。大量实验表明,UltraGen首次有效地将预训练的低分辨率视频模型扩展至1080P甚至4K分辨率,在定性和定量评估中均优于现有的最先进方法和基于超分辨率的两阶段流程。
English
Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (<=720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.
PDF142October 22, 2025