有限数量的标量量化：简化的VQ-VAE

摘要

我们建议在VQ-VAE的潜在表示中用一种简单的方案有限标量量化（FSQ）取代向量量化（VQ），其中我们将VAE表示投影到少量维度（通常少于10）。每个维度被量化为一组固定值，导致一个（隐式的）码书由这些集合的乘积给出。通过适当选择维度的数量和每个维度可以取的值，我们获得与VQ中相同的码书大小。在这种离散表示之上，我们可以训练已经在VQ-VAE表示上训练过的相同模型。例如，自回归和掩蔽变压器模型用于图像生成、多模态生成以及密集预测计算机视觉任务。具体来说，我们在图像生成中使用MaskGIT进行FSQ，在深度估计、着色和全景分割中使用UViM。尽管FSQ的设计要简单得多，但我们在所有这些任务中获得了有竞争力的性能。我们强调，FSQ不会遭受码书崩溃，并且不需要VQ中使用的复杂机制（承诺损失、码书重新种植、码拆分、熵惩罚等）来学习表达力强的离散表示。

English

We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.

有限数量的标量量化：简化的VQ-VAE

Finite Scalar Quantization: VQ-VAE Made Simple

摘要

Summary

Support