Ming-UniVision: 통합 연속 토크나이저를 활용한 이미지 이해 및 생성의 통합

초록

시각적 토큰화는 자동회귀 패러다임 내에서 시각적 이해와 생성을 통합하는 데 있어 여전히 핵심적인 과제로 남아 있습니다. 기존 방법들은 일반적으로 이산 잠재 공간에서 토크나이저를 사용하여 대규모 언어 모델의 토큰과 정렬하지만, 양자화 오류로 인해 의미 표현력이 제한되고 시각-언어 이해 능력이 저하될 수 있습니다. 이를 해결하기 위해, 우리는 연속 잠재 공간을 가진 새로운 시각적 토큰화기 패밀리인 MingTok을 소개합니다. MingTok은 통합 자동회귀 생성과 이해를 위해 설계되었습니다. 이해 작업은 판별적 고차원 특징을 선호하는 반면, 생성 작업은 간결한 저수준 코드를 선호합니다. 따라서 이러한 상충되는 요구를 조화롭게 만족시키기 위해 MingTok은 저수준 인코딩, 의미 확장, 시각적 재구성의 세 단계 순차적 아키텍처를 채택합니다. 이를 기반으로 구축된 Ming-UniVision은 작업별 시각적 표현의 필요성을 없애고, 다양한 시각-언어 작업을 단일 자동회귀 예측 패러다임으로 통합합니다. 이해와 생성을 공유된 연속 공간에서의 다음 토큰 예측으로 공식화함으로써, 반복적 이해, 생성 및 편집과 같은 다중 라운드, 문맥 내 작업을 원활하게 지원합니다. 실험적으로, 통합된 연속 시각적 표현을 사용함으로써 이해와 생성 작업이 토큰화기에 요구하는 상충되는 요구 사항을 조화롭게 만족시켜, 두 영역 모두에서 최첨단 수준의 성능을 달성할 수 있음을 확인했습니다. 우리의 연구 결과가 연속 영역에서의 통합 시각적 토큰화를 촉진하기를 바랍니다. 추론 코드와 모델 가중치는 커뮤니티의 이익을 위해 공개되었습니다.

English

Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.

Ming-UniVision: 통합 연속 토크나이저를 활용한 이미지 이해 및 생성의 통합

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

초록

Support