UniTok: Een Geünificeerde Tokenizer voor Visuele Generatie en Begrip

Samenvatting

De representatiekloof tussen visuele generatie en begrip legt een kritische kloof bloot bij het integreren van deze mogelijkheden in een enkel kader. Om deze kloof te overbruggen, introduceren we UniTok, een discrete visuele tokenizer die fijngestructureerde details codeert voor generatie en tegelijkertijd hoog-niveau semantiek vastlegt voor begrip. Ondanks recente studies die hebben aangetoond dat deze doelstellingen verliesconflicten kunnen veroorzaken tijdens training, onthullen we dat de onderliggende bottleneck voortkomt uit de beperkte representatiecapaciteit van discrete tokens. We pakken dit aan door multi-codebook kwantisatie te introduceren, die vector kwantisatie verdeelt met verschillende onafhankelijke sub-codebooks om de latente functieruimte uit te breiden, terwijl training instabiliteit veroorzaakt door te grote codebooks wordt vermeden. Onze methode verhoogt aanzienlijk de bovengrens van verenigde discrete tokenizers om domeinspecifieke continue tokenizers te evenaren of zelfs te overtreffen. Zo behaalt UniTok bijvoorbeeld een opmerkelijke rFID van 0.38 (versus 0.87 voor SD-VAE) en een zero-shot nauwkeurigheid van 78.6% (versus 76.2% voor CLIP) op ImageNet. Onze code is beschikbaar op https://github.com/FoundationVision/UniTok.

English

The representation disparity between visual generation and understanding imposes a critical gap in integrating these capabilities into a single framework. To bridge this gap, we introduce UniTok, a discrete visual tokenizer that encodes fine-grained details for generation while also capturing high-level semantics for understanding. Despite recent studies have shown that these objectives could induce loss conflicts in training, we reveal that the underlying bottleneck stems from limited representational capacity of discrete tokens. We address this by introducing multi-codebook quantization, which divides vector quantization with several independent sub-codebooks to expand the latent feature space, while avoiding training instability caused by overlarge codebooks. Our method significantly raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers. For instance, UniTok achieves a remarkable rFID of 0.38 (versus 0.87 for SD-VAE) and a zero-shot accuracy of 78.6% (versus 76.2% for CLIP) on ImageNet. Our code is available at https://github.com/FoundationVision/UniTok.

UniTok: Een Geünificeerde Tokenizer voor Visuele Generatie en Begrip

UniTok: A Unified Tokenizer for Visual Generation and Understanding

Samenvatting

Support