유한 스칼라 양자화: 단순화된 VQ-VAE

초록

우리는 VQ-VAE의 잠재 표현에서 벡터 양자화(VQ)를 대체할 간단한 기법인 유한 스칼라 양자화(FSQ)를 제안합니다. 이 기법에서는 VAE 표현을 몇 개의 차원(일반적으로 10개 미만)으로 축소합니다. 각 차원은 작은 고정 값 집합으로 양자화되며, 이 집합들의 곱으로 (암묵적인) 코드북이 생성됩니다. 차원의 수와 각 차원이 가질 수 있는 값을 적절히 선택함으로써 VQ와 동일한 코드북 크기를 얻을 수 있습니다. 이러한 이산 표현 위에서, VQ-VAE 표현으로 훈련된 것과 동일한 모델들을 훈련시킬 수 있습니다. 예를 들어, 이미지 생성을 위한 자기회귀 모델 및 마스크된 트랜스포머 모델, 다중모달 생성, 그리고 밀집 예측 컴퓨터 비전 작업 등이 있습니다. 구체적으로, 우리는 이미지 생성을 위해 MaskGIT과 함께 FSQ를 사용하며, 깊이 추정, 색상화, 그리고 파노픽 세분화를 위해 UViM과 함께 FSQ를 사용합니다. FSQ의 훨씬 간단한 설계에도 불구하고, 우리는 이러한 모든 작업에서 경쟁력 있는 성능을 얻습니다. 우리는 FSQ가 코드북 붕괴 문제를 겪지 않으며, 표현력 있는 이산 표현을 학습하기 위해 VQ에서 사용되는 복잡한 메커니즘(커밋먼트 손실, 코드북 재시드, 코드 분할, 엔트로피 페널티 등)이 필요하지 않음을 강조합니다.

English

We propose to replace vector quantization (VQ) in the latent representation of VQ-VAEs with a simple scheme termed finite scalar quantization (FSQ), where we project the VAE representation down to a few dimensions (typically less than 10). Each dimension is quantized to a small set of fixed values, leading to an (implicit) codebook given by the product of these sets. By appropriately choosing the number of dimensions and values each dimension can take, we obtain the same codebook size as in VQ. On top of such discrete representations, we can train the same models that have been trained on VQ-VAE representations. For example, autoregressive and masked transformer models for image generation, multimodal generation, and dense prediction computer vision tasks. Concretely, we employ FSQ with MaskGIT for image generation, and with UViM for depth estimation, colorization, and panoptic segmentation. Despite the much simpler design of FSQ, we obtain competitive performance in all these tasks. We emphasize that FSQ does not suffer from codebook collapse and does not need the complex machinery employed in VQ (commitment losses, codebook reseeding, code splitting, entropy penalties, etc.) to learn expressive discrete representations.

유한 스칼라 양자화: 단순화된 VQ-VAE

Finite Scalar Quantization: VQ-VAE Made Simple

초록

Support