UniTok: 시각적 생성과 이해를 위한 통합 토크나이저

초록

시각 생성과 이해 간의 표현 차이는 이러한 능력을 단일 프레임워크로 통합하는 데 있어 중요한 격차를 야기합니다. 이 격차를 해소하기 위해, 우리는 세부적인 생성 정보를 인코딩하면서도 고수준의 의미를 포착하는 이산 시각 토크나이저인 UniTok을 소개합니다. 최근 연구들은 이러한 목표들이 훈련 과정에서 손실 충돌을 유발할 수 있음을 보여주었지만, 우리는 근본적인 병목 현상이 이산 토큰의 제한된 표현 능력에서 비롯됨을 밝혔습니다. 이를 해결하기 위해, 우리는 여러 독립적인 서브 코드북으로 벡터 양자화를 분할하여 잠재 특징 공간을 확장하면서도 과도하게 큰 코드북으로 인한 훈련 불안정성을 피하는 다중 코드북 양자화를 도입했습니다. 우리의 방법은 통합 이산 토크나이저의 상한선을 크게 높여 도메인 특화 연속 토크나이저와 견줄 만하거나 이를 능가하는 성능을 달성합니다. 예를 들어, UniTok은 ImageNet에서 0.38의 rFID(SD-VAE의 0.87 대비)와 78.6%의 제로샷 정확도(CLIP의 76.2% 대비)를 기록했습니다. 우리의 코드는 https://github.com/FoundationVision/UniTok에서 확인할 수 있습니다.

English

The representation disparity between visual generation and understanding imposes a critical gap in integrating these capabilities into a single framework. To bridge this gap, we introduce UniTok, a discrete visual tokenizer that encodes fine-grained details for generation while also capturing high-level semantics for understanding. Despite recent studies have shown that these objectives could induce loss conflicts in training, we reveal that the underlying bottleneck stems from limited representational capacity of discrete tokens. We address this by introducing multi-codebook quantization, which divides vector quantization with several independent sub-codebooks to expand the latent feature space, while avoiding training instability caused by overlarge codebooks. Our method significantly raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers. For instance, UniTok achieves a remarkable rFID of 0.38 (versus 0.87 for SD-VAE) and a zero-shot accuracy of 78.6% (versus 76.2% for CLIP) on ImageNet. Our code is available at https://github.com/FoundationVision/UniTok.

UniTok: 시각적 생성과 이해를 위한 통합 토크나이저

UniTok: A Unified Tokenizer for Visual Generation and Understanding

초록

Support