UltraGen: Hoogresolutie videogeneratie met hiërarchische aandacht

Samenvatting

Recente vooruitgang in videogeneratie heeft het mogelijk gemaakt visueel aantrekkelijke video's te produceren, met een breed scala aan toepassingen in contentcreatie, entertainment en virtual reality. De meeste bestaande videogeneratiemodellen gebaseerd op diffusie-transformers zijn echter beperkt tot lage-resolutie uitvoer (<=720P) vanwege de kwadratische rekencomplexiteit van het aandachtmechanisme ten opzichte van de uitvoerbreedte en -hoogte. Dit rekenkundige knelpunt maakt native hoog-resolutie videogeneratie (1080P/2K/4K) onpraktisch voor zowel training als inferentie. Om deze uitdaging aan te pakken, presenteren we UltraGen, een nieuw videogeneratiekader dat i) efficiënte en ii) end-to-end native hoog-resolutie videosynthese mogelijk maakt. Specifiek kenmerkt UltraGen zich door een hiërarchische dual-branch aandachtarchitectuur gebaseerd op globale-lokale aandachtdecompositie, die volledige aandacht ontkoppelt in een lokale aandachtbranch voor hoogwaardige regionale inhoud en een globale aandachtbranch voor algehele semantische consistentie. We stellen verder een ruimtelijk gecomprimeerde globale modelleringsstrategie voor om efficiënt globale afhankelijkheden te leren, en een hiërarchisch kruisvenster lokaal aandachtmechanisme om rekenkosten te verminderen terwijl de informatiestroom tussen verschillende lokale vensters wordt verbeterd. Uitgebreide experimenten tonen aan dat UltraGen voor het eerst voorgetrainde lage-resolutie videomodellen effectief kan opschalen naar 1080P en zelfs 4K-resolutie, waarbij het bestaande state-of-the-art methoden en super-resolutie gebaseerde tweestaps pijplijnen overtreft in zowel kwalitatieve als kwantitatieve evaluaties.

English

Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (<=720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.

UltraGen: Hoogresolutie videogeneratie met hiërarchische aandacht

UltraGen: High-Resolution Video Generation with Hierarchical Attention

Samenvatting

Support