CODA：将连续变分自编码器改造用于离散标记化

摘要

离散视觉分词器将图像转化为一系列标记，使得基于标记的视觉生成类似于语言模型。然而，这一过程本身具有挑战性，因为它既需要将视觉信号压缩为紧凑的表示，又需要将其离散化为固定的编码集。传统的离散分词器通常联合学习这两项任务，往往导致训练不稳定、码本利用率低以及重建质量有限。本文提出CODA（连续到离散适应）框架，该框架将压缩与离散化过程解耦。CODA并非从头训练离散分词器，而是通过精心设计的离散化过程，将现成的连续变分自编码器（VAE）——这些VAE已针对感知压缩进行了优化——适应为离散分词器。通过主要聚焦于离散化，CODA确保了训练过程的稳定与高效，同时保持了连续VAE强大的视觉保真度。实验表明，在ImageNet 256×256基准测试中，相较于标准VQGAN，我们的方法仅需六分之一的训练预算，便实现了100%的码本利用率，并在8倍和16倍压缩下分别取得了0.43和1.34的显著重建FID（rFID）成绩。

English

Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both compressing visual signals into a compact representation and discretizing them into a fixed set of codes. Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce CODA(COntinuous-to-Discrete Adaptation), a framework that decouples compression and discretization. Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs -- already optimized for perceptual compression -- into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs. Empirically, with 6 times less training budget than standard VQGAN, our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of 0.43 and 1.34 for 8 times and 16 times compression on ImageNet 256times 256 benchmark.

CODA：将连续变分自编码器改造用于离散标记化

CODA: Repurposing Continuous VAEs for Discrete Tokenization

摘要

Support