컴팩트한 텍스트 인식 일차원 토큰을 활용한 텍스트-이미지 마스크 생성 모델의 민주화

초록

이미지 토크나이저는 현대의 텍스트-이미지 생성 모델의 기초를 형성하지만 훈련이 어렵다는 것으로 유명합니다. 게다가 대부분의 기존 텍스트-이미지 모델은 대규모이고 고품질의 사적 데이터셋에 의존하기 때문에 재현이 어렵습니다. 본 연구에서는 효율적이고 강력한 이미지 토크나이저인 텍스트 인식 트랜스포머 기반 1차원 토크나이저(TA-TiTok)를 소개합니다. TA-TiTok은 이산 또는 연속적인 1차원 토큰을 활용할 수 있습니다. TA-TiTok은 토크나이저 디코딩 단계(즉, 디토크나이제이션)에서 텍스트 정보를 독특하게 통합하여 수렴을 가속화하고 성능을 향상시킵니다. TA-TiTok은 또한 이전 1차원 토크나이저에서 사용된 복잡한 이차원 증류 과정을 제거하고 간소화되고 효과적인 단계적 훈련 과정을 통해 이점을 얻습니다. 이러한 설계는 대규모 데이터셋으로의 원활한 확장성을 허용합니다. 여기에 기반하여, 우리는 공개 데이터만을 사용하여 훈련된 텍스트-이미지 마스크 생성 모델(MaskGen)의 가족을 소개합니다. 이 모델은 사적 데이터로 훈련된 모델과 비교 가능한 성능을 달성합니다. 우리는 효율적이고 강력한 TA-TiTok 토크나이저와 텍스트-이미지 마스크 생성 모델(MaskGen)을 공개하여 텍스트-이미지 마스크 생성 모델 분야의 보다 넓은 접근과 민주화를 촉진하고자 합니다.

English

Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.

컴팩트한 텍스트 인식 일차원 토큰을 활용한 텍스트-이미지 마스크 생성 모델의 민주화

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

초록

Support