ChatPaper.aiChatPaper

NativeTok:原生视觉分词技术助力图像生成质量提升

NativeTok: Native Visual Tokenization for Improved Image Generation

January 30, 2026
作者: Bin Wu, Mengqi Huang, Weinan Jia, Zhendong Mao
cs.AI

摘要

基于向量量化的图像生成通常采用两阶段流程:首先通过分词器将图像编码为离散标记,随后生成模型学习标记间的依赖关系以实现重建。然而,第一阶段分词效果的提升未必能改善第二阶段的生成质量,因为现有方法未能有效约束标记间的依赖关系。这种不匹配迫使生成模型从无序分布中学习,导致生成结果存在偏差且连贯性较弱。为此,我们提出原生视觉分词技术,通过在分词阶段强制建立因果依赖关系来解决该问题。基于此思路,我们开发了NativeTok框架,该框架在实现高效重建的同时,将关系约束嵌入标记序列中。NativeTok包含两个核心组件:(1)用于潜在图像建模的元图像变换器;(2)因果专家混合变换器,其中每个轻量化专家模块基于先验标记和潜在特征生成单个标记。我们进一步设计了分层原生训练策略,仅更新新增的专家模块以保证训练效率。大量实验验证了NativeTok的有效性。
English
VQ-based image generation typically follows a two-stage pipeline: a tokenizer encodes images into discrete tokens, and a generative model learns their dependencies for reconstruction. However, improved tokenization in the first stage does not necessarily enhance the second-stage generation, as existing methods fail to constrain token dependencies. This mismatch forces the generative model to learn from unordered distributions, leading to bias and weak coherence. To address this, we propose native visual tokenization, which enforces causal dependencies during tokenization. Building on this idea, we introduce NativeTok, a framework that achieves efficient reconstruction while embedding relational constraints within token sequences. NativeTok consists of: (1) a Meta Image Transformer (MIT) for latent image modeling, and (2) a Mixture of Causal Expert Transformer (MoCET), where each lightweight expert block generates a single token conditioned on prior tokens and latent features. We further design a Hierarchical Native Training strategy that updates only new expert blocks, ensuring training efficiency. Extensive experiments demonstrate the effectiveness of NativeTok.
PDF92February 3, 2026