端到端自回归图像生成与一维语义分词器

摘要

自迴歸影像建模依賴視覺標記器將圖像壓縮為緊湊的潛在表徵。我們設計了一個端到端的訓練流程，能同時優化重建與生成任務，使生成結果可直接對標記器產生監督作用。這種方法有別於以往將標記器與生成模型分開訓練的兩階段策略。我們進一步研究如何利用視覺基礎模型來改進適用於自迴歸建模的一維標記器。實驗結果表明，我們的自迴歸生成模型在ImageNet 256×256生成任務上取得了顯著成效，其中無引導生成的最新FID分數達1.48。

English

Autoregressive image modeling relies on visual tokenizers to compress images into compact latent representations. We design an end-to-end training pipeline that jointly optimizes reconstruction and generation, enabling direct supervision from generation results to the tokenizer. This contrasts with prior two-stage approaches that train tokenizers and generative models separately. We further investigate leveraging vision foundation models to improve 1D tokenizers for autoregressive modeling. Our autoregressive generative model achieves strong empirical results, including a state-of-the-art FID score of 1.48 without guidance on ImageNet 256x256 generation.

端到端自回归图像生成与一维语义分词器

End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

摘要

Support