ChatPaper.aiChatPaper

逐通道向量量化

Channel-wise Vector Quantization

May 25, 2026
作者: Wei Song, Tianhang Wang, Yitong Chen, Tong Zhang, Zuxuan Wu, Ming Li, Jiaqi Wang, Kaicheng Yu
cs.AI

摘要

我們提出通道式向量量化(Channel-wise Vector Quantization, CVQ),這是一種新穎的圖像分詞範式,以通道式分詞取代傳統的區塊式分詞。與常規向量量化將每個區塊特徵向量分配一個離散標記不同,CVQ對特徵圖的每個通道進行量化。此表述方式將圖像表示為視覺細節的離散層級,而非空間區塊的網格。基於CVQ,我們引入了一種全新的視覺自回歸框架,採用「下一個通道預測」策略。我們的通道式自回歸(Channel-wise Autoregressive, CAR)模型並非按光柵順序逐塊渲染圖像,而是按順序預測圖像通道,逐步生成更豐富的視覺細節。具體而言,它首先勾勒全局結構,然後細化精細屬性,類似於人類藝術家的創作流程。實驗結果顯示:(1)CVQ在無需任何附加技巧的情況下,實現了16K以上碼本大小的100%碼本利用率,並顯著提升了重建品質;(2)CAR在DPG得分達到86.7、GenEval得分為0.79,展現出在文字到圖像生成任務中的強大有效性。
English
We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.