有限純量量化：簡化的向量量化變分自編碼器(VQ-VAE)

摘要

我們建議在VQ-VAE的潛在表示中用一種簡單的方案有限純量量化（FSQ）取代向量量化（VQ），其中我們將VAE表示投影到少數維度（通常少於10個）。每個維度被量化為一組固定值的小集合，從而產生由這些集合的乘積組成的（隱式）碼本。通過適當地選擇維度的數量和每個維度可以取的值，我們獲得與VQ中相同的碼本大小。在這種離散表示之上，我們可以訓練已經在VQ-VAE表示上訓練過的相同模型。例如，自回歸和遮罩變壓器模型用於圖像生成、多模態生成和密集預測計算機視覺任務。具體來說，我們在圖像生成中使用MaskGIT進行FSQ，在深度估計、著色和全景分割中使用UViM。儘管FSQ的設計要簡單得多，但我們在所有這些任務中獲得了有競爭力的表現。我們強調，FSQ不會遭受碼本崩潰，也不需要VQ中使用的複雜機制（承諾損失、碼本重新種植、碼分割、熵懲罰等）來學習表達豐富的離散表示。

English

We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.

有限純量量化：簡化的向量量化變分自編碼器(VQ-VAE)

Finite Scalar Quantization: VQ-VAE Made Simple

摘要

Support