MANZANO：一种简单且可扩展的统一多模态模型，配备混合视觉分词器

摘要

统一的多模态大语言模型（LLMs）能够同时理解并生成视觉内容，展现出巨大的潜力。然而，现有的开源模型往往在这两种能力之间存在性能权衡。我们提出了Manzano，一个简单且可扩展的统一框架，通过结合混合图像分词器与精心设计的训练方案，显著缓解了这一矛盾。一个共享的视觉编码器为两个轻量级适配器提供输入，这些适配器在共同的语义空间中生成用于图像到文本理解的连续嵌入和用于文本到图像生成的离散标记。统一的自回归LLM以文本和图像标记的形式预测高层语义，随后辅助的扩散解码器将这些图像标记转换为像素。该架构结合了理解和生成数据的统一训练方案，使得两种能力的联合学习能够规模化进行。Manzano在统一模型中取得了最先进的成果，并在文本丰富的评估中与专业模型相媲美。我们的研究表明任务冲突极小，且随着模型规模的扩大获得了一致的增益，验证了我们采用混合分词器的设计选择。

English

Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

MANZANO：一种简单且可扩展的统一多模态模型，配备混合视觉分词器

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

摘要

Support