原生分辨率图像合成

摘要

我们提出了原生分辨率图像合成技术，这是一种创新的生成建模范式，能够在任意分辨率和宽高比下合成图像。该方法通过原生处理可变长度的视觉标记，克服了传统固定分辨率、方形图像方法的局限，这是传统技术面临的核心挑战。为此，我们引入了原生分辨率扩散Transformer（NiT），这一架构在其去噪过程中显式地建模了不同的分辨率和宽高比。摆脱了固定格式的束缚，NiT能够从涵盖广泛分辨率和宽高比的图像中学习内在的视觉分布。值得注意的是，单个NiT模型同时在ImageNet-256x256和512x512基准测试中达到了最先进的性能。令人惊讶的是，类似于先进大语言模型展现出的强大零样本能力，仅基于ImageNet训练的NiT也展示了卓越的零样本泛化性能。它成功地在之前未见的高分辨率（如1536 x 1536）和多样宽高比（如16:9、3:1、4:3）下生成了高保真图像，如图1所示。这些发现表明，原生分辨率建模作为视觉生成建模与先进LLM方法论之间的桥梁，具有巨大的潜力。

English

We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the limitations of conventional fixed-resolution, square-image methods by natively handling variable-length visual tokens, a core challenge for traditional techniques. To this end, we introduce the Native-resolution diffusion Transformer (NiT), an architecture designed to explicitly model varying resolutions and aspect ratios within its denoising process. Free from the constraints of fixed formats, NiT learns intrinsic visual distributions from images spanning a broad range of resolutions and aspect ratios. Notably, a single NiT model simultaneously achieves the state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks. Surprisingly, akin to the robust zero-shot capabilities seen in advanced large language models, NiT, trained solely on ImageNet, demonstrates excellent zero-shot generalization performance. It successfully generates high-fidelity images at previously unseen high resolutions (e.g., 1536 x 1536) and diverse aspect ratios (e.g., 16:9, 3:1, 4:3), as shown in Figure 1. These findings indicate the significant potential of native-resolution modeling as a bridge between visual generative modeling and advanced LLM methodologies.

原生分辨率图像合成

Native-Resolution Image Synthesis

摘要

Support