Ming-UniVision:基於統一連續標記器的聯合圖像理解與生成
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer
October 8, 2025
作者: Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, Jun Zhou
cs.AI
摘要
視覺標記化仍然是統一自回歸範式下視覺理解與生成的核心挑戰。現有方法通常採用離散潛在空間中的標記器,以與大型語言模型的標記對齊,其中量化誤差可能限制語義表達能力,並削弱視覺語言理解的能力。為解決這一問題,我們引入了MingTok,這是一系列具有連續潛在空間的新型視覺標記器,用於統一的自回歸生成與理解。雖然理解任務偏好判別性的高維特徵,生成任務則傾向於緊湊的低層次編碼。因此,為調和這些相互競爭的需求,MingTok採用了包含低層次編碼、語義擴展和視覺重建的三階段序列架構。基於此,Ming-UniVision消除了對任務特定視覺表示的需求,並將多樣的視覺語言任務統一在單一的自回歸預測範式下。通過將理解與生成都表述為共享連續空間中的下一個標記預測,它無縫支持多輪、上下文相關的任務,如迭代理解、生成和編輯。實證研究表明,使用統一的連續視覺表示能夠調和理解與生成任務對標記器的競爭性要求,從而在兩個領域均達到最先進的性能水平。我們希望我們的研究發現能促進連續域中的統一視覺標記化。推理代碼和模型權重已發布,以惠及社區。
English
Visual tokenization remains a core challenge in unifying visual understanding
and generation within the autoregressive paradigm. Existing methods typically
employ tokenizers in discrete latent spaces to align with the tokens from large
language models, where the quantization errors can limit semantic
expressiveness and degrade the capability of vision-language understanding. To
address this, we introduce MingTok, a new family of visual tokenizers with a
continuous latent space, for unified autoregressive generation and
understanding. While understanding tasks favor discriminative high-dimensional
features, generation tasks prefer compact low-level codes. Thus, to reconcile
these competing demands, MingTok adopts a three-stage sequential architecture
involving low-level encoding, semantic expansion, and visual reconstruction.
Built on top of it, Ming-UniVision eliminates the need for task-specific visual
representations, and unifies diverse vision-language tasks under a single
autoregrsssive prediction paradigm. By formulating both understanding and
generation as next-token prediction in a shared continuous space, it seamlessly
supports multi-round, in-context tasks such as iterative understanding,
generation and editing. Empirically, we find that using a unified continuous
visual representation reconciles the competing requirements on the tokenizers
by the understanding and generation tasks, thereby leading to state-of-the-art
level performance across both domains. We hope our findings will facilitate
unified visual tokenization in the continuous domain. Inference code and model
weights are released to benefit community.