ChatPaper.aiChatPaper

Ming-UniVision:基于统一连续分词器的联合图像理解与生成

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

October 8, 2025
作者: Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, Jun Zhou
cs.AI

摘要

视觉标记化仍然是统一自回归范式下视觉理解与生成的核心挑战。现有方法通常采用离散潜在空间中的标记器,以与大型语言模型的标记对齐,然而量化误差可能限制语义表达能力,削弱视觉语言理解的效果。为此,我们提出了MingTok,一种新型连续潜在空间视觉标记器家族,旨在实现统一的自回归生成与理解。鉴于理解任务偏好判别性的高维特征,而生成任务则倾向于紧凑的低级编码,MingTok采用了一种三阶段序列架构,包括低级编码、语义扩展和视觉重建。基于此,Ming-UniVision消除了对任务特定视觉表示的需求,将多样化的视觉语言任务统一于单一的自回归预测范式之下。通过将理解和生成均表述为共享连续空间中的下一标记预测,它无缝支持多轮、上下文相关的任务,如迭代理解、生成与编辑。实证研究表明,采用统一的连续视觉表示能够调和理解与生成任务对标记器的竞争性要求,从而在两大领域均达到顶尖性能水平。我们期望这些发现能推动连续域内统一视觉标记化的发展。为惠及社区,我们已发布推理代码与模型权重。
English
Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.
PDF663October 9, 2025