MANZANO: 하이브리드 비전 토크나이저를 활용한 간단하고 확장 가능한 통합 멀티모달 모델

초록

시각적 콘텐츠를 이해하고 생성할 수 있는 통합 멀티모달 대형 언어 모델(LLM)은 엄청난 잠재력을 가지고 있습니다. 그러나 기존의 오픈소스 모델들은 종종 이러한 기능 간의 성능 상충 관계에 직면합니다. 우리는 하이브리드 이미지 토크나이저와 잘 정제된 훈련 레시피를 결합하여 이러한 긴장을 상당히 완화하는 간단하고 확장 가능한 통합 프레임워크인 Manzano를 제시합니다. 단일 공유 비전 인코더가 두 개의 경량 어댑터에 입력을 제공하며, 이 어댑터들은 공통의 의미 공간 내에서 이미지-텍스트 이해를 위한 연속 임베딩과 텍스트-이미지 생성을 위한 이산 토큰을 생성합니다. 통합된 자동회귀 LLM은 텍스트와 이미지 토큰 형태의 고수준 의미를 예측하며, 보조 디퓨전 디코더는 이후 이미지 토큰을 픽셀로 변환합니다. 이 아키텍처는 이해와 생성 데이터에 대한 통합 훈련 레시피와 함께 두 기능의 확장 가능한 공동 학습을 가능하게 합니다. Manzano는 통합 모델 중에서 최첨단 결과를 달성하며, 특히 텍스트 중심 평가에서 전문 모델과 경쟁력을 보입니다. 우리의 연구는 하이브리드 토크나이저 설계 선택의 타당성을 검증하며, 최소한의 작업 충돌과 모델 크기 확장에서의 일관된 이득을 보여줍니다.

English

Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

MANZANO: 하이브리드 비전 토크나이저를 활용한 간단하고 확장 가능한 통합 멀티모달 모델

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

초록

Support