ChatPaper.aiChatPaper

稳定图像自回归建模的潜空间:一个统一的视角

Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective

October 16, 2024
作者: Yongxin Zhu, Bocheng Li, Hang Zhang, Xin Li, Linli Xu, Lidong Bing
cs.AI

摘要

基于潜在空间的图像生成模型,如潜在扩散模型(LDMs)和掩膜图像模型(MIMs),在图像生成任务中取得了显著成功。这些模型通常利用像VQGAN或VAE这样的重构自动编码器,将像素编码为更紧凑的潜在空间,并学习潜在空间中的数据分布,而不是直接从像素中学习。然而,这种做法引发了一个重要问题:这真的是最佳选择吗?作为回应,我们从一个有趣的观察开始:尽管共享相同的潜在空间,自回归模型在图像生成方面明显落后于LDMs和MIMs。这一发现与自然语言处理领域形成鲜明对比,那里自回归模型GPT已经建立了强大的地位。为了解决这一差异,我们提出了一个关于潜在空间和生成模型关系的统一视角,强调图像生成建模中潜在空间的稳定性。此外,我们提出了一个简单但有效的离散图像标记器,以稳定图像生成建模中的潜在空间。实验结果表明,使用我们的标记器(DiGIT)进行图像自回归建模既有利于图像理解又有利于图像生成,采用下一个标记预测原则,这对于GPT模型来说本质上是直接的,但对其他生成模型来说是具有挑战性的。值得注意的是,首次出现了一种针对图像的GPT风格自回归模型优于LDMs的情况,当模型规模扩大时,也类似于GPT表现出显著改进。我们的发现强调了优化潜在空间和整合离散标记化对推进图像生成模型能力的潜力。代码可在https://github.com/DAMO-NLP-SG/DiGIT找到。
English
Latent-based image generative models, such as Latent Diffusion Models (LDMs) and Mask Image Models (MIMs), have achieved notable success in image generation tasks. These models typically leverage reconstructive autoencoders like VQGAN or VAE to encode pixels into a more compact latent space and learn the data distribution in the latent space instead of directly from pixels. However, this practice raises a pertinent question: Is it truly the optimal choice? In response, we begin with an intriguing observation: despite sharing the same latent space, autoregressive models significantly lag behind LDMs and MIMs in image generation. This finding contrasts sharply with the field of NLP, where the autoregressive model GPT has established a commanding presence. To address this discrepancy, we introduce a unified perspective on the relationship between latent space and generative models, emphasizing the stability of latent space in image generative modeling. Furthermore, we propose a simple but effective discrete image tokenizer to stabilize the latent space for image generative modeling. Experimental results show that image autoregressive modeling with our tokenizer (DiGIT) benefits both image understanding and image generation with the next token prediction principle, which is inherently straightforward for GPT models but challenging for other generative models. Remarkably, for the first time, a GPT-style autoregressive model for images outperforms LDMs, which also exhibits substantial improvement akin to GPT when scaling up model size. Our findings underscore the potential of an optimized latent space and the integration of discrete tokenization in advancing the capabilities of image generative models. The code is available at https://github.com/DAMO-NLP-SG/DiGIT.

Summary

AI-Generated Summary

PDF82November 16, 2024