チャネル単位のベクトル量子化

要旨

我々は、チャネル方向ベクトル量子化（CVQ）を提案する。これはパッチ単位のトークンをチャネル単位のトークンに置き換える新たな画像トークン化パラダイムである。従来のベクトル量子化が各パッチ特徴ベクトルに離散トークンを割り当てるのとは異なり、CVQは特徴マップの各チャネルを量子化する。この定式化は、画像を空間パッチのグリッドとしてではなく、視覚的詳細の離散レベルとして表現する。CVQに基づき、「次のチャネル予測」を用いた新しい視覚的自己回帰フレームワークを導入する。我々のチャネル方向自己回帰（CAR）モデルは、ラスタ順にパッチごとに画像を描画する代わりに、画像チャネルを逐次的に予測し、徐々に豊かになる視覚的詳細を生成する。具体的には、最初に大域構造をスケッチし、その後、細かい属性を精緻化する。これは人間のアーティストの作業手順に類似している。経験的に、以下のことを示す：（1）CVQは特別な工夫なしに16K以上のコードブックサイズで100％のコードブック利用率を達成し、従来のVQと比較して再構成品質を大幅に改善する。（2）CARはDPGスコア86.7、GenEvalスコア0.79を達成し、テキストから画像生成における強力な有効性を示す。

English

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.