ChatPaper.aiChatPaper

CODA:重新利用連續變分自編碼器實現離散標記化

CODA: Repurposing Continuous VAEs for Discrete Tokenization

March 22, 2025
作者: Zeyu Liu, Zanlin Ni, Yeguo Hua, Xin Deng, Xiao Ma, Cheng Zhong, Gao Huang
cs.AI

摘要

離散視覺標記器將圖像轉換為一系列標記,實現了類似於語言模型的基於標記的視覺生成。然而,這一過程本質上具有挑戰性,因為它既需要將視覺信號壓縮成緊湊的表示,又需要將其離散化為一組固定的編碼。傳統的離散標記器通常將這兩個任務聯合學習,這往往導致訓練不穩定、編碼簿利用率低以及重建質量有限。本文中,我們提出了CODA(連續到離散適應)框架,該框架將壓縮和離散化分離。CODA並非從頭訓練離散標記器,而是通過精心設計的離散化過程,將現成的連續變分自編碼器(VAE)——已經針對感知壓縮進行了優化——適應為離散標記器。通過主要關注離散化,CODA確保了訓練的穩定性和效率,同時保留了連續VAE的強大視覺保真度。實驗表明,在ImageNet 256×256基準測試中,與標準VQGAN相比,我們的訓練預算減少了6倍,實現了100%的編碼簿利用率,並在8倍和16倍壓縮下分別取得了0.43和1.34的顯著重建FID(rFID)成績。
English
Discrete visual tokenizers transform images into a sequence of tokens, enabling token-based visual generation akin to language models. However, this process is inherently challenging, as it requires both compressing visual signals into a compact representation and discretizing them into a fixed set of codes. Traditional discrete tokenizers typically learn the two tasks jointly, often leading to unstable training, low codebook utilization, and limited reconstruction quality. In this paper, we introduce CODA(COntinuous-to-Discrete Adaptation), a framework that decouples compression and discretization. Instead of training discrete tokenizers from scratch, CODA adapts off-the-shelf continuous VAEs -- already optimized for perceptual compression -- into discrete tokenizers via a carefully designed discretization process. By primarily focusing on discretization, CODA ensures stable and efficient training while retaining the strong visual fidelity of continuous VAEs. Empirically, with 6 times less training budget than standard VQGAN, our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of 0.43 and 1.34 for 8 times and 16 times compression on ImageNet 256times 256 benchmark.

Summary

AI-Generated Summary

PDF32March 25, 2025