橋接連續與離散標記以實現自回歸視覺生成

摘要

自迴歸視覺生成模型通常依賴於標記器將圖像壓縮成可順序預測的標記。在標記表示中存在一個基本困境：離散標記能夠通過標準交叉熵損失進行直接建模，但會遭受信息損失和標記器訓練不穩定的問題；連續標記能更好地保留視覺細節，但需要複雜的分佈建模，這使得生成流程變得複雜。在本論文中，我們提出了TokenBridge，它通過保持連續標記的強大表示能力，同時保留離散標記的建模簡潔性，來彌合這一差距。為實現這一點，我們通過訓練後量化將離散化與標記器訓練過程解耦，直接從連續表示中獲取離散標記。具體來說，我們引入了一種維度量化策略，獨立地對每個特徵維度進行離散化，並配備了一個輕量級的自迴歸預測機制，以高效地建模由此產生的大規模標記空間。大量實驗表明，我們的方法在使用標準分類預測的同時，達到了與連續方法相當的重建和生成質量。這項工作表明，橋接離散和連續範式能夠有效利用兩種方法的優勢，為通過簡單的自迴歸建模實現高質量視覺生成提供了一個有前景的方向。項目頁面：https://yuqingwang1029.github.io/TokenBridge。

English

Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling. Project page: https://yuqingwang1029.github.io/TokenBridge.

橋接連續與離散標記以實現自回歸視覺生成

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

摘要

Support