ChatPaper.aiChatPaper

UniTok:視覺生成與理解之統一化分詞器

UniTok: A Unified Tokenizer for Visual Generation and Understanding

February 27, 2025
作者: Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, Xiaojuan Qi
cs.AI

摘要

視覺生成與理解之間的表示差異,在將這兩種能力整合到單一框架時形成了一道關鍵鴻溝。為彌合這一差距,我們提出了UniTok,這是一種離散視覺標記器,它既能為生成任務編碼細粒度細節,又能捕捉用於理解的高層語義。儘管近期研究表明這些目標可能在訓練中引發損失衝突,但我們揭示出,根本的瓶頸在於離散標記的表示能力有限。我們通過引入多碼本量化來解決這一問題,該方法利用多個獨立子碼本進行向量量化,從而擴展潛在特徵空間,同時避免因過大碼本導致的訓練不穩定性。我們的方法顯著提升了統一離散標記器的性能上限,使其能夠匹配甚至超越特定領域的連續標記器。例如,UniTok在ImageNet上取得了令人矚目的rFID值0.38(相比於SD-VAE的0.87)和零樣本準確率78.6%(相比於CLIP的76.2%)。我們的代碼已開源於https://github.com/FoundationVision/UniTok。
English
The representation disparity between visual generation and understanding imposes a critical gap in integrating these capabilities into a single framework. To bridge this gap, we introduce UniTok, a discrete visual tokenizer that encodes fine-grained details for generation while also capturing high-level semantics for understanding. Despite recent studies have shown that these objectives could induce loss conflicts in training, we reveal that the underlying bottleneck stems from limited representational capacity of discrete tokens. We address this by introducing multi-codebook quantization, which divides vector quantization with several independent sub-codebooks to expand the latent feature space, while avoiding training instability caused by overlarge codebooks. Our method significantly raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers. For instance, UniTok achieves a remarkable rFID of 0.38 (versus 0.87 for SD-VAE) and a zero-shot accuracy of 78.6% (versus 76.2% for CLIP) on ImageNet. Our code is available at https://github.com/FoundationVision/UniTok.

Summary

AI-Generated Summary

PDF302February 28, 2025