Ming-UniVision：統合型連続トークナイザーを用いた画像理解と生成の共同処理

要旨

視覚的トークン化は、自己回帰パラダイム内で視覚的理解と生成を統合する上で核心的な課題として残っている。既存の手法では、大規模言語モデルのトークンと整合させるために、離散的な潜在空間におけるトークナイザーを採用することが一般的であり、量子化誤差が意味表現力を制限し、視覚言語理解の能力を低下させる可能性がある。これを解決するため、我々は連続的な潜在空間を持つ新しい視覚トークナイザーファミリーであるMingTokを提案し、統一された自己回帰生成と理解を実現する。理解タスクは識別可能な高次元特徴を好む一方で、生成タスクはコンパクトな低次元コードを好む。したがって、これらの相反する要求を調和させるために、MingTokは低次元エンコーディング、意味的拡張、視覚的再構築を含む三段階のシーケンシャルアーキテクチャを採用している。その上に構築されたMing-UniVisionは、タスク固有の視覚表現を不要とし、多様な視覚言語タスクを単一の自己回帰予測パラダイムの下で統合する。理解と生成の両方を共有された連続空間における次トークン予測として定式化することで、反復的な理解、生成、編集などのマルチラウンド、コンテキスト内タスクをシームレスにサポートする。実験的に、統一された連続視覚表現を使用することで、理解タスクと生成タスクがトークナイザーに求める相反する要求を調和させ、両ドメインにおいて最先端の性能を達成することが確認された。我々の発見が、連続領域における統一された視覚トークン化を促進することを期待する。推論コードとモデル重みはコミュニティの利益のために公開されている。

English

Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.

Ming-UniVision：統合型連続トークナイザーを用いた画像理解と生成の共同処理

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

要旨

Support