MANZANO：一種簡單且可擴展的統一多模態模型，配備混合視覺標記器

摘要

統一的多模態大型語言模型（LLMs）既能理解又能生成視覺內容，具有巨大的潛力。然而，現有的開源模型往往在這些能力之間存在性能上的權衡。我們提出了Manzano，這是一個簡單且可擴展的統一框架，通過將混合圖像標記器與精心策劃的訓練方案相結合，大幅減少了這種張力。單一的共享視覺編碼器為兩個輕量級適配器提供輸入，這些適配器在共同的語義空間內生成用於圖像到文本理解的連續嵌入和用於文本到圖像生成的離散標記。統一的自回歸LLM以文本和圖像標記的形式預測高層語義，隨後輔助的擴散解碼器將圖像標記轉換為像素。該架構與理解與生成數據的統一訓練方案相結合，使得兩種能力的可擴展聯合學習成為可能。Manzano在統一模型中取得了最先進的成果，並在專業模型上具有競爭力，特別是在富含文本的評估中。我們的研究顯示了最小的任務衝突和隨著模型規模擴展而持續獲得的增益，驗證了我們選擇混合標記器的設計決策。

English

Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

MANZANO：一種簡單且可擴展的統一多模態模型，配備混合視覺標記器

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

摘要

Support