MaskBit:通过比特标记实现无嵌入的图像生成
MaskBit: Embedding-free Image Generation via Bit Tokens
September 24, 2024
作者: Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen
cs.AI
摘要
遮蔽变换器模型用于有条件类别的图像生成,已成为扩散模型的一个引人注目的替代方案。通常包括两个阶段 - 初始的VQGAN模型用于在潜在空间和图像空间之间过渡,以及随后的变换器模型用于在潜在空间内进行图像生成 - 这些框架为图像合成提供了有前途的途径。在这项研究中,我们提出了两个主要贡献:首先,对VQGAN进行经验和系统化的检验,从而形成现代化的VQGAN。其次,提出了一种新颖的无嵌入式生成网络,直接在位元标记上运行 - 一种具有丰富语义的二进制量化表示的标记。第一个贡献提供了一个透明、可复现且性能优越的VQGAN模型,增强了可访问性,并匹配了当前最先进方法的性能,同时揭示了以前未披露的细节。第二个贡献表明,使用位元标记的无嵌入式图像生成实现了ImageNet 256x256基准测试中新的最先进FID为1.52,生成器模型仅有305M参数。
English
Masked transformer models for class-conditional image generation have become
a compelling alternative to diffusion models. Typically comprising two stages -
an initial VQGAN model for transitioning between latent space and image space,
and a subsequent Transformer model for image generation within latent space -
these frameworks offer promising avenues for image synthesis. In this study, we
present two primary contributions: Firstly, an empirical and systematic
examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel
embedding-free generation network operating directly on bit tokens - a binary
quantized representation of tokens with rich semantics. The first contribution
furnishes a transparent, reproducible, and high-performing VQGAN model,
enhancing accessibility and matching the performance of current
state-of-the-art methods while revealing previously undisclosed details. The
second contribution demonstrates that embedding-free image generation using bit
tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256
benchmark, with a compact generator model of mere 305M parameters.