MaskBit: ビットトークンを用いた埋め込みフリーの画像生成

要旨

マスクされたトランスフォーマーモデルは、クラス条件付き画像生成において拡散モデルに対する説得力のある代替手段となっています。通常、2つの段階で構成され、最初のVQGANモデルは潜在空間と画像空間の間の遷移を担い、その後に続くトランスフォーマーモデルは潜在空間内での画像生成を担当します。これらのフレームワークは、画像合成における有望な手法を提供しています。本研究では、主に2つの貢献を示しています。第一に、VQGANに関する経験的かつ体系的な検証を行い、近代化されたVQGANを提案しています。第二に、豊かな意味を持つトークンのバイナリ量子化表現であるビットトークンに直接作用する埋め込みフリーの生成ネットワークを提示しています。第一の貢献は、透明性があり再現性があり、高性能なVQGANモデルを提供し、アクセス性を向上させ、従来の最先端手法との性能を一致させながら、以前に開示されていなかった詳細を明らかにします。第二の貢献は、ビットトークンを使用した埋め込みフリーの画像生成が、ImageNet 256x256のベンチマークで新たな最先端のFID値1.52を達成し、わずか305Mパラメータのコンパクトな生成モデルを実現することを示しています。

English

Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256 benchmark, with a compact generator model of mere 305M parameters.

MaskBit: ビットトークンを用いた埋め込みフリーの画像生成

MaskBit: Embedding-free Image Generation via Bit Tokens

要旨

Support