ChatPaper.aiChatPaper

逐通道向量量化

Channel-wise Vector Quantization

May 25, 2026
作者: Wei Song, Tianhang Wang, Yitong Chen, Tong Zhang, Zuxuan Wu, Ming Li, Jiaqi Wang, Kaicheng Yu
cs.AI

摘要

我们提出通道级向量量化(CVQ),这是一种新颖的图像分词范式,用通道级标记取代了补丁级标记。与传统的向量量化(为每个补丁特征向量分配离散标记)不同,CVQ对特征图的每个通道进行量化。这种表示方式将图像呈现为视觉细节的离散层级,而非空间补丁的网格。基于CVQ,我们引入了一种采用“下一通道预测”的新视觉自回归框架。我们的通道级自回归(CAR)模型不再按照光栅顺序逐补丁渲染图像,而是依次预测图像通道,逐步生成更丰富的视觉细节。具体而言,它首先勾勒全局结构,然后细化精细属性,类似于人类艺术家的创作流程。实验表明:(1)CVQ在无任何额外技巧的情况下,实现了16K+码本大小的100%码本利用率,并显著提升了传统VQ的重建质量;(2)CAR的DPG分数达到86.7,GenEval分数达到0.79,在文本到图像生成任务中展现出强大的有效性。
English
We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.