채널별 벡터 양자화

초록

본 연구에서는 패치 단위 토큰을 채널 단위 토큰으로 대체하는 새로운 이미지 토큰화 패러다임인 채널별 벡터 양자화(CVQ)를 제안합니다. 기존의 벡터 양자화가 각 패치 특징 벡터에 이산 토큰을 할당하는 반면, CVQ는 특징 맵의 각 채널을 양자화합니다. 이러한 방식은 이미지를 공간적 패치 격자가 아닌 시각적 세부 정보의 이산적 수준으로 표현합니다. CVQ를 기반으로, "다음-채널 예측"을 활용하는 새로운 시각적 자기회귀 프레임워크를 소개합니다. 본 연구의 채널별 자기회귀(CAR) 모델은 이미지를 래스터 순서로 패치 단위로 렌더링하는 대신, 채널을 순차적으로 예측하여 점진적으로 풍부해지는 시각적 세부 정보를 생성합니다. 구체적으로, 먼저 전역 구조를 스케치한 후 세밀한 속성을 정제하는데, 이는 인간 예술가의 작업 흐름과 유사합니다. 실험 결과, (1) CVQ는 별도의 부가 장치 없이 16K+ 크기의 코드북에서 100% 코드북 활용률을 달성하고, 기존 VQ 대비 재구성 품질을 크게 향상시키며, (2) CAR은 DPG 점수 86.7, GenEval 점수 0.79를 달성하여 텍스트-이미지 생성에서 강력한 효과를 입증합니다.

English

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.