DetailFlow：基于下一细节预测的一维由粗到细自回归图像生成

摘要

本文提出DetailFlow，一种从粗到细的一维自回归（AR）图像生成方法，通过新颖的细节预测策略对图像进行建模。通过利用逐步降质的图像监督学习分辨率感知的token序列，DetailFlow使生成过程能够从全局结构出发，逐步细化细节。这种从粗到细的一维token序列与自回归推理机制高度契合，为AR模型生成复杂视觉内容提供了一种更为自然且高效的方式。我们紧凑的一维AR模型在显著减少token数量的情况下实现了高质量的图像合成，相较于VAR/VQGAN等先前方法。此外，我们提出了一种带自校正的并行推理机制，将生成速度提升约8倍，同时减少了教师强制监督中固有的累积采样误差。在ImageNet 256x256基准测试中，我们的方法仅使用128个token便取得了2.96的gFID，优于需要680个token的VAR（3.3 FID）和FlexVAR（3.05 FID）。得益于显著减少的token数量和并行推理机制，我们的方法在推理速度上比VAR和FlexVAR快了近2倍。大量实验结果表明，DetailFlow在生成质量和效率上均优于现有的最先进方法。

English

This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method that models images through a novel next-detail prediction strategy. By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure and incrementally refine details. This coarse-to-fine 1D token sequence aligns well with the autoregressive inference mechanism, providing a more natural and efficient way for the AR model to generate complex visual content. Our compact 1D AR model achieves high-quality image synthesis with significantly fewer tokens than previous approaches, i.e. VAR/VQGAN. We further propose a parallel inference mechanism with self-correction that accelerates generation speed by approximately 8x while reducing accumulation sampling error inherent in teacher-forcing supervision. On the ImageNet 256x256 benchmark, our method achieves 2.96 gFID with 128 tokens, outperforming VAR (3.3 FID) and FlexVAR (3.05 FID), which both require 680 tokens in their AR models. Moreover, due to the significantly reduced token count and parallel inference mechanism, our method runs nearly 2x faster inference speed compared to VAR and FlexVAR. Extensive experimental results demonstrate DetailFlow's superior generation quality and efficiency compared to existing state-of-the-art methods.

DetailFlow：基于下一细节预测的一维由粗到细自回归图像生成

DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction

摘要

Support