连接连续与离散标记的自回归视觉生成

摘要

自回归视觉生成模型通常依赖于分词器将图像压缩为可顺序预测的标记。在标记表示中存在一个根本性的困境：离散标记能够通过标准的交叉熵损失进行直接建模，但会遭受信息丢失和分词器训练不稳定的问题；连续标记能更好地保留视觉细节，但需要复杂的分布建模，使生成流程变得复杂。本文提出TokenBridge，通过保持连续标记的强大表示能力，同时保留离散标记的建模简洁性，来弥合这一差距。为此，我们通过训练后量化将离散化过程与分词器训练解耦，直接从连续表示中获取离散标记。具体而言，我们引入了一种维度量化策略，独立地对每个特征维度进行离散化，并搭配一个轻量级的自回归预测机制，有效建模由此产生的大规模标记空间。大量实验表明，我们的方法在使用标准分类预测的同时，实现了与连续方法相当的重建和生成质量。这项工作表明，融合离散与连续范式能够有效结合两者的优势，为通过简单的自回归建模实现高质量视觉生成提供了有前景的方向。项目页面：https://yuqingwang1029.github.io/TokenBridge。

English

Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially. A fundamental dilemma exists in token representation: discrete tokens enable straightforward modeling with standard cross-entropy loss, but suffer from information loss and tokenizer training instability; continuous tokens better preserve visual details, but require complex distribution modeling, complicating the generation pipeline. In this paper, we propose TokenBridge, which bridges this gap by maintaining the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. To achieve this, we decouple discretization from the tokenizer training process through post-training quantization that directly obtains discrete tokens from continuous representations. Specifically, we introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism that efficiently model the resulting large token space. Extensive experiments show that our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction. This work demonstrates that bridging discrete and continuous paradigms can effectively harness the strengths of both approaches, providing a promising direction for high-quality visual generation with simple autoregressive modeling. Project page: https://yuqingwang1029.github.io/TokenBridge.

连接连续与离散标记的自回归视觉生成

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

摘要

Support