NativeTok：原生视觉分词技术赋能图像生成新突破

摘要

基于向量量化的图像生成通常采用两阶段流程：分词器将图像编码为离散标记，生成模型则学习标记间的依赖关系以实现重建。然而，第一阶段分词技术的改进未必能提升第二阶段的生成效果，因为现有方法无法有效约束标记间的依赖关系。这种不匹配迫使生成模型从无序分布中学习，导致生成结果存在偏差且连贯性较弱。为解决此问题，我们提出原生视觉分词技术，通过在分词过程中强制建立因果依赖关系。基于这一理念，我们开发了NativeTok框架，该框架在实现高效重建的同时，将关系约束嵌入标记序列。NativeTok包含两大核心组件：（1）用于潜在图像建模的元图像变换器；（2）因果专家混合变换器，其中每个轻量级专家模块基于先验标记和潜在特征生成单个标记。我们进一步设计了分层原生训练策略，仅需更新新增的专家模块即可确保训练效率。大量实验验证了NativeTok的有效性。

English

VQ-based image generation typically follows a two-stage pipeline: a tokenizer encodes images into discrete tokens, and a generative model learns their dependencies for reconstruction. However, improved tokenization in the first stage does not necessarily enhance the second-stage generation, as existing methods fail to constrain token dependencies. This mismatch forces the generative model to learn from unordered distributions, leading to bias and weak coherence. To address this, we propose native visual tokenization, which enforces causal dependencies during tokenization. Building on this idea, we introduce NativeTok, a framework that achieves efficient reconstruction while embedding relational constraints within token sequences. NativeTok consists of: (1) a Meta Image Transformer (MIT) for latent image modeling, and (2) a Mixture of Causal Expert Transformer (MoCET), where each lightweight expert block generates a single token conditioned on prior tokens and latent features. We further design a Hierarchical Native Training strategy that updates only new expert blocks, ensuring training efficiency. Extensive experiments demonstrate the effectiveness of NativeTok.

NativeTok：原生视觉分词技术赋能图像生成新突破

NativeTok: Native Visual Tokenization for Improved Image Generation

摘要

Support