CODA: 離散トークン化のための連続VAEの再活用

要旨

離散的なビジュアルトークナイザーは、画像を一連のトークンに変換し、言語モデルと同様のトークンベースの視覚生成を可能にします。しかし、このプロセスは本質的に困難であり、視覚信号をコンパクトな表現に圧縮し、固定されたコードセットに離散化する必要があります。従来の離散トークナイザーは通常、これら2つのタスクを同時に学習するため、不安定なトレーニング、低いコードブック利用率、限定的な再構成品質が生じることが多いです。本論文では、圧縮と離散化を分離するフレームワークであるCODA（COntinuous-to-Discrete Adaptation）を紹介します。CODAは、ゼロから離散トークナイザーをトレーニングする代わりに、既に知覚的圧縮に最適化された既存の連続VAEを、慎重に設計された離散化プロセスを介して離散トークナイザーに適応させます。離散化に主眼を置くことで、CODAは安定した効率的なトレーニングを確保しつつ、連続VAEの強力な視覚的忠実性を維持します。実験的に、標準的なVQGANの6分の1のトレーニング予算で、ImageNet 256×256ベンチマークにおいて、8倍および16倍の圧縮に対して、100%の顕著なコードブック利用率と、0.43および1.34の再構成FID（rFID）を達成しました。

English

Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both compressing visual signals into a compact representation and discretizing them into a fixed set of codes. Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce CODA(COntinuous-to-Discrete Adaptation), a framework that decouples compression and discretization. Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs -- already optimized for perceptual compression -- into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs. Empirically, with 6 times less training budget than standard VQGAN, our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of 0.43 and 1.34 for 8 times and 16 times compression on ImageNet 256times 256 benchmark.

CODA: 離散トークン化のための連続VAEの再活用

CODA: Repurposing Continuous VAEs for Discrete Tokenization

要旨

Support