DetailFlow: 次詳細予測による1次元の粗から細への自己回帰的画像生成

要旨

本論文では、DetailFlowと呼ばれる、粗から細へと進化する1次元自己回帰（AR）画像生成手法を提案する。この手法は、新たな「次詳細予測」戦略を通じて画像をモデル化する。解像度を意識したトークン列を段階的に劣化させた画像で教師あり学習することで、DetailFlowは生成プロセスを大域的な構造から開始し、徐々に詳細を洗練させていくことを可能にする。この粗から細への1次元トークン列は、自己回帰推論メカニズムとよく適合し、ARモデルが複雑な視覚的コンテンツを生成するためのより自然で効率的な方法を提供する。我々のコンパクトな1次元ARモデルは、従来のVAR/VQGANなどの手法と比べて大幅に少ないトークン数で高品質な画像合成を実現する。さらに、自己補正を伴う並列推論メカニズムを提案し、教師強制型の教師あり学習に内在する蓄積サンプリングエラーを低減しながら、生成速度を約8倍加速する。ImageNet 256x256ベンチマークにおいて、本手法は128トークンで2.96 gFIDを達成し、ARモデルで680トークンを必要とするVAR（3.3 FID）やFlexVAR（3.05 FID）を上回る。さらに、大幅に削減されたトークン数と並列推論メカニズムにより、本手法はVARやFlexVARと比べて推論速度が約2倍高速である。広範な実験結果は、DetailFlowが既存の最先端手法と比較して優れた生成品質と効率性を有することを示している。

English

This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method that models images through a novel next-detail prediction strategy. By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure and incrementally refine details. This coarse-to-fine 1D token sequence aligns well with the autoregressive inference mechanism, providing a more natural and efficient way for the AR model to generate complex visual content. Our compact 1D AR model achieves high-quality image synthesis with significantly fewer tokens than previous approaches, i.e. VAR/VQGAN. We further propose a parallel inference mechanism with self-correction that accelerates generation speed by approximately 8x while reducing accumulation sampling error inherent in teacher-forcing supervision. On the ImageNet 256x256 benchmark, our method achieves 2.96 gFID with 128 tokens, outperforming VAR (3.3 FID) and FlexVAR (3.05 FID), which both require 680 tokens in their AR models. Moreover, due to the significantly reduced token count and parallel inference mechanism, our method runs nearly 2x faster inference speed compared to VAR and FlexVAR. Extensive experimental results demonstrate DetailFlow's superior generation quality and efficiency compared to existing state-of-the-art methods.

DetailFlow: 次詳細予測による1次元の粗から細への自己回帰的画像生成

DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction

要旨

Support