GigaTok：自己回帰的画像生成のためのビジュアルトークナイザを30億パラメータにスケーリング

要旨

自己回帰（AR）画像生成において、ビジュアルトークナイザーは画像をコンパクトな離散潜在トークンに圧縮し、次のトークン予測による視覚生成のための下流自己回帰モデルの効率的な学習を可能にします。ビジュアルトークナイザーのスケーリングは画像再構成品質を向上させますが、しばしば下流生成品質を低下させるという課題があり、既存の研究では十分に対処されていません。この課題に対処するため、我々はGigaTokを導入します。これは、ビジュアルトークナイザーのスケーリング時に画像再構成、生成、および表現学習を同時に改善する初めてのアプローチです。我々は、潜在空間の複雑さの増大が再構成と生成のジレンマの背後にある主要な要因であることを特定しました。これを緩和するため、セマンティック正則化を提案します。これは、トークナイザーの特徴を事前学習されたビジュアルエンコーダーからの意味的に一貫した特徴と整合させるものです。この制約により、スケーリング中の過剰な潜在空間の複雑さが防止され、再構成と下流自己回帰生成の両方で一貫した改善がもたらされます。セマンティック正則化を基盤として、トークナイザーのスケーリングにおける3つの重要な実践を探求します：（1）スケーラビリティを向上させるための1Dトークナイザーの使用、（2）エンコーダーとデコーダーの両方を拡張する際のデコーダースケーリングの優先、（3）ビリオンスケールのトークナイザーの学習を安定化するためのエントロピー損失の採用。30億パラメータにスケーリングすることで、GigaTokは再構成、下流AR生成、および下流AR表現品質において最先端の性能を達成します。

English

In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to 3 space billion parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.

GigaTok：自己回帰的画像生成のためのビジュアルトークナイザを30億パラメータにスケーリング

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

要旨

Support