生成式精炼网络在视觉合成中的应用

摘要

尽管扩散模型在视觉生成领域占据主导地位，但其计算效率低下，无论面对何种复杂度的内容都采用统一计算强度。相比之下，自回归模型通过其可变的似然度展现出对内容复杂度的天然感知能力，但常因有损的离散标记化处理和误差累积问题而受限。本研究提出生成式精修网络（GRN），这一新一代视觉合成范式旨在解决上述问题。GRN的核心突破在于通过理论近乎无损的层次化二值量化（HBQ）技术消除离散标记化瓶颈，其重建质量可与连续表示相媲美。基于HBQ的潜空间，GRN通过全局精修机制对自回归生成进行根本性升级——该机制能像人类艺术家作画般逐步完善和修正作品。此外，GRN融合了熵引导采样策略，在保持视觉质量的前提下实现复杂度感知的自适应步长生成。在ImageNet基准测试中，GRN在图像重建（0.56 rFID）和类别条件图像生成（1.81 gFID）方面创下新纪录。我们还将GRN扩展至更具挑战性的文本到图像及文本到视频生成任务，在同等规模下展现出卓越性能。我们将公开所有模型与代码，以推动GRN的后续研究。

English

While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. In contrast, autoregressive (AR) models are inherently complexity-aware, as evidenced by their variable likelihoods, but are often hindered by lossy discrete tokenization and error accumulation. In this work, we introduce Generative Refinement Networks (GRN), a next-generation visual synthesis paradigm to address these issues. At its core, GRN addresses the discrete tokenization bottleneck through a theoretically near-lossless Hierarchical Binary Quantization (HBQ), achieving a reconstruction quality comparable to continuous counterparts. Built upon HBQ's latent space, GRN fundamentally upgrades AR generation with a global refinement mechanism that progressively perfects and corrects artworks -- like a human artist painting. Besides, GRN integrates an entropy-guided sampling strategy, enabling complexity-aware, adaptive-step generation without compromising visual quality. On the ImageNet benchmark, GRN establishes new records in image reconstruction (0.56 rFID) and class-conditional image generation (1.81 gFID). We also scale GRN to more challenging text-to-image and text-to-video generation, delivering superior performance on an equivalent scale. We release all models and code to foster further research on GRN.

生成式精炼网络在视觉合成中的应用

Generative Refinement Networks for Visual Synthesis

摘要

Support