DetailFlow：基於下一細節預測的一維從粗到細自回歸圖像生成

摘要

本文提出DetailFlow，一種從粗到細的一維自迴歸（AR）圖像生成方法，通過新穎的下一細節預測策略來建模圖像。DetailFlow通過學習以逐步降質圖像監督的分辨率感知標記序列，使生成過程能夠從全局結構開始並逐步細化細節。這種從粗到細的一維標記序列與自迴歸推理機制高度契合，為AR模型生成複雜視覺內容提供了一種更自然且高效的方式。我們緊湊的一維AR模型在顯著少於先前方法（如VAR/VQGAN）的標記數量下實現了高質量的圖像合成。我們進一步提出了一種帶有自我校正的並行推理機制，將生成速度提升約8倍，同時減少了教師強制監督中固有的累積採樣誤差。在ImageNet 256x256基準測試中，我們的方法僅使用128個標記便達到了2.96 gFID，優於需要680個標記的VAR（3.3 FID）和FlexVAR（3.05 FID）。此外，由於顯著減少的標記數量和並行推理機制，我們的方法在推理速度上比VAR和FlexVAR快了近2倍。大量實驗結果表明，DetailFlow在生成質量和效率上均優於現有的最先進方法。

English

This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method that models images through a novel next-detail prediction strategy. By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure and incrementally refine details. This coarse-to-fine 1D token sequence aligns well with the autoregressive inference mechanism, providing a more natural and efficient way for the AR model to generate complex visual content. Our compact 1D AR model achieves high-quality image synthesis with significantly fewer tokens than previous approaches, i.e. VAR/VQGAN. We further propose a parallel inference mechanism with self-correction that accelerates generation speed by approximately 8x while reducing accumulation sampling error inherent in teacher-forcing supervision. On the ImageNet 256x256 benchmark, our method achieves 2.96 gFID with 128 tokens, outperforming VAR (3.3 FID) and FlexVAR (3.05 FID), which both require 680 tokens in their AR models. Moreover, due to the significantly reduced token count and parallel inference mechanism, our method runs nearly 2x faster inference speed compared to VAR and FlexVAR. Extensive experimental results demonstrate DetailFlow's superior generation quality and efficiency compared to existing state-of-the-art methods.

DetailFlow：基於下一細節預測的一維從粗到細自回歸圖像生成

DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction

摘要

Support