自己回帰的視覚生成のための連続的トークンと離散的トークンの橋渡し

要旨

自己回帰型の視覚生成モデルは通常、画像を逐次予測可能なトークンに圧縮するためのトークナイザーに依存しています。トークン表現においては根本的なジレンマが存在します：離散トークンは標準的な交差エントロピー損失を用いた単純なモデリングを可能にしますが、情報の損失やトークナイザーの訓練不安定性に悩まされます。一方、連続トークンは視覚的詳細をより良く保持しますが、複雑な分布モデリングを必要とし、生成パイプラインを複雑化します。本論文では、連続トークンの強力な表現能力を維持しつつ、離散トークンのモデリングの単純さを保持するTokenBridgeを提案します。これを実現するために、我々はトークナイザーの訓練プロセスから離散化を切り離し、連続表現から直接離散トークンを取得する訓練後量子化を採用します。具体的には、各特徴次元を独立に離散化する次元単位の量子化戦略を導入し、それに伴う大規模なトークン空間を効率的にモデル化する軽量な自己回帰予測メカニズムを組み合わせます。大規模な実験により、本手法が連続手法と同等の再構成および生成品質を達成しつつ、標準的なカテゴリカル予測を使用できることが示されました。この研究は、離散と連続のパラダイムを橋渡しすることで両アプローチの強みを効果的に活用し、単純な自己回帰モデリングによる高品質な視覚生成の有望な方向性を提供することを実証しています。プロジェクトページ：https://yuqingwang1029.github.io/TokenBridge。

English

Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling. Project page: https://yuqingwang1029.github.io/TokenBridge.

自己回帰的視覚生成のための連続的トークンと離散的トークンの橋渡し

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

要旨

Support