CODA: 이산 토큰화를 위한 연속 VAE의 재활용

초록

이산적 시각 토크나이저는 이미지를 토큰 시퀀스로 변환하여 언어 모델과 유사한 토큰 기반 시각 생성이 가능하게 합니다. 그러나 이 과정은 본질적으로 어려운 과제인데, 시각 신호를 압축된 표현으로 축소하는 동시에 고정된 코드 집합으로 이산화해야 하기 때문입니다. 기존의 이산적 토크나이저는 일반적으로 이 두 작업을 함께 학습하지만, 이로 인해 훈련 불안정성, 낮은 코드북 활용도, 제한된 재구성 품질 등의 문제가 발생합니다. 본 논문에서는 압축과 이산화를 분리하는 CODA(COntinuous-to-Discrete Adaptation) 프레임워크를 소개합니다. CODA는 처음부터 이산적 토크나이저를 훈련시키는 대신, 이미 지각적 압축에 최적화된 기존의 연속적 VAE(변분 자동인코더)를 신중하게 설계된 이산화 과정을 통해 이산적 토크나이저로 적응시킵니다. 이산화에 주력함으로써, CODA는 연속적 VAE의 강력한 시각적 충실도를 유지하면서도 안정적이고 효율적인 훈련을 보장합니다. 실험적으로, 표준 VQGAN 대비 6배 적은 훈련 비용으로, 우리의 접근 방식은 100%의 놀라운 코드북 활용률과 ImageNet 256×256 벤치마크에서 8배 및 16배 압축 시 각각 0.43과 1.34의 뛰어난 재구성 FID(rFID)를 달성했습니다.

English

Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both compressing visual signals into a compact representation and discretizing them into a fixed set of codes. Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce CODA(COntinuous-to-Discrete Adaptation), a framework that decouples compression and discretization. Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs -- already optimized for perceptual compression -- into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs. Empirically, with 6 times less training budget than standard VQGAN, our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of 0.43 and 1.34 for 8 times and 16 times compression on ImageNet 256times 256 benchmark.

CODA: 이산 토큰화를 위한 연속 VAE의 재활용

CODA: Repurposing Continuous VAEs for Discrete Tokenization

초록

Support