有限スカラー量子化：VQ-VAEの簡素化

要旨

VQ-VAEの潜在表現におけるベクトル量子化（VQ）を、有限スカラー量子化（FSQ）と呼ばれるシンプルな手法に置き換えることを提案します。FSQでは、VAEの表現を少数の次元（通常10未満）に投影し、各次元を小さな固定値のセットに量子化します。これにより、これらのセットの直積によって与えられる（暗黙的な）コードブックが生成されます。次元数と各次元が取り得る値の数を適切に選択することで、VQと同じサイズのコードブックを得ることができます。このような離散表現の上で、VQ-VAE表現で訓練されてきたのと同じモデルを訓練することが可能です。例えば、画像生成のための自己回帰モデルやマスク付きトランスフォーマーモデル、マルチモーダル生成、そして密な予測を伴うコンピュータビジョンタスクなどです。具体的には、画像生成にはMaskGITと、深度推定、カラー化、パノプティックセグメンテーションにはUViMとFSQを組み合わせて使用します。FSQの設計は非常にシンプルであるにもかかわらず、これら全てのタスクで競争力のある性能を達成しています。FSQはコードブックの崩壊に悩まされることがなく、表現力のある離散表現を学習するためにVQで必要とされる複雑な機構（コミットメント損失、コードブックの再シード、コード分割、エントロピーペナルティなど）を必要としないことを強調します。

English

We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.

有限スカラー量子化：VQ-VAEの簡素化

Finite Scalar Quantization: VQ-VAE Made Simple

要旨

Support