原生解析度影像合成

摘要

我們引入了原生分辨率圖像合成，這是一種新穎的生成建模範式，能夠在任意分辨率和寬高比下合成圖像。該方法通過原生處理可變長度的視覺標記，克服了傳統固定分辨率、方形圖像方法的限制，這是傳統技術面臨的核心挑戰。為此，我們提出了原生分辨率擴散變換器（NiT），這是一種在其去噪過程中顯式建模不同分辨率和寬高比的架構。擺脫固定格式的束縛，NiT從涵蓋廣泛分辨率和寬高比的圖像中學習內在的視覺分佈。值得注意的是，單一的NiT模型同時在ImageNet-256x256和512x512基準上達到了最先進的性能。令人驚訝的是，類似於高級大型語言模型中觀察到的強大零樣本能力，僅在ImageNet上訓練的NiT展示了出色的零樣本泛化性能。它成功地在以前未見過的高分辨率（例如1536 x 1536）和多樣化的寬高比（例如16:9、3:1、4:3）下生成高保真圖像，如圖1所示。這些發現表明，原生分辨率建模作為視覺生成建模與高級LLM方法之間的橋樑，具有顯著的潛力。

English

We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution, square-image methods by natively handling variable-length visual tokens, a core challenge for traditional techniques. To this end, we introduce the Native-resolution diffusion Transformer (NiT), an architecture designed to explicitly model varying resolutions and aspect ratios within its denoising process. Free from the constraints of fixed formats, NiT learns intrinsic visual distributions from images spanning a broad range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen in advanced large language models, NiT, trained solely on ImageNet, demonstrates excellent zero-shot generalization performance. It successfully generates high-fidelity images at previously unseen high resolutions (e.g., 1536 x 1536) and diverse aspect ratios (e.g., 16:9, 3:1, 4:3), as shown in Figure 1. These findings indicate the significant potential of native-resolution modeling as a bridge between visual generative modeling and advanced LLM methodologies.

原生解析度影像合成

Native-Resolution Image Synthesis

摘要

Support