Quantification vectorielle par canal

Résumé

Nous présentons la quantification vectorielle par canal (CVQ), un nouveau paradigme de tokenisation d'images qui remplace les jetons par patch par des jetons par canal. Contrairement à la quantification vectorielle conventionnelle, qui attribue un jeton discret à chaque vecteur de caractéristiques de patch, la CVQ quantifie chaque canal de la carte de caractéristiques. Cette formulation représente une image comme des niveaux discrets de détails visuels, plutôt que comme une grille de patches spatiaux. Basé sur la CVQ, nous introduisons un nouveau cadre autorégressif visuel avec "prédiction du canal suivant". Au lieu de rendre les images patch par patch en ordre raster, notre modèle autorégressif par canal (CAR) prédit les canaux d'image séquentiellement, produisant des détails visuels progressivement enrichis. Plus précisément, il esquisse d'abord la structure globale puis affine les attributs fins, à la manière du flux de travail d'un artiste humain. Empiriquement, nous montrons que : (1) la CVQ atteint une utilisation du codebook de 100 % avec une taille de codebook de 16K+ sans artifices, et améliore considérablement la qualité de reconstruction par rapport à la VQ conventionnelle ; et (2) le CAR obtient un score DPG de 86,7 et un score GenEval de 0,79, démontrant une forte efficacité pour la génération texte-image.

English

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.