因子分解されたビジュアルトークン化と生成

要旨

ビジュアルトークナイザーは画像生成において基本的です。これらは視覚データを離散的なトークンに変換し、トランスフォーマーベースのモデルが画像生成で優れた性能を発揮できるようにします。成功を収めているにも関わらず、VQGANのようなVQベースのトークナイザーは、制約された語彙サイズによる重要な制約に直面しています。コードブックを単純に拡張するだけでは、しばしばトレーニングの不安定性や性能の低下を招き、スケーラビリティが重要な課題となります。本研究では、大きなコードブックを複数の独立したサブコードブックに分解することで、VQベースのトークナイザーを活性化する革新的な手法であるFactorized Quantization（FQ）を紹介します。この因数分解により、大きなコードブックのルックアップ複雑さが低減され、より効率的でスケーラブルなビジュアルトークナイゼーションが可能となります。各サブコードブックが異なるかつ補完的な情報を捉えるようにするため、冗長性を明示的に減少させ、サブコードブック全体で多様性を促進するdisentanglement regularizationを提案します。さらに、表現学習をトレーニングプロセスに統合し、CLIPやDINOなどの事前学習済みビジョンモデルを活用して、学習された表現に意味豊かさを注入します。この設計により、当該トークナイザーが多様な意味レベルを捉え、より表現豊かで分離された表現を生み出すことが確実となります。実験結果は、提案されたFQGANモデルが視覚トークナイザーの再構成品質を大幅に向上させ、最先端の性能を達成していることを示しています。さらに、このトークナイザーが効果的に自己回帰型画像生成に適応できることを示しています。https://showlab.github.io/FQGAN

English

Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant limitations due to constrained vocabulary sizes. Simply expanding the codebook often leads to training instability and diminishing performance gains, making scalability a critical challenge. In this work, we introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks. This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization. To ensure each sub-codebook captures distinct and complementary information, we propose a disentanglement regularization that explicitly reduces redundancy, promoting diversity across the sub-codebooks. Furthermore, we integrate representation learning into the training process, leveraging pretrained vision models like CLIP and DINO to infuse semantic richness into the learned representations. This design ensures our tokenizer captures diverse semantic levels, leading to more expressive and disentangled representations. Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance. We further demonstrate that this tokenizer can be effectively adapted into auto-regressive image generation. https://showlab.github.io/FQGAN

因子分解されたビジュアルトークン化と生成

Factorized Visual Tokenization and Generation

要旨

Support