ChatPaper.aiChatPaper

變色龍:混合模態早期融合基礎模型

Chameleon: Mixed-Modal Early-Fusion Foundation Models

May 16, 2024
作者: Chameleon Team
cs.AI

摘要

我們提出了Chameleon,這是一系列早期融合基於標記的混合模態模型,能夠理解和生成圖像和文本,並以任意順序呈現。我們概述了從一開始就穩定的訓練方法,一個對齊配方,以及針對早期融合、基於標記的混合模態設置量身定制的架構參數化。這些模型在各種任務上進行了評估,包括視覺問答、圖像標題生成、文本生成、圖像生成和長篇混合模態生成。Chameleon展示了廣泛且通用的能力,包括在圖像標題生成任務中表現出色,優於僅文本任務中的Llama-2,並與Mixtral 8x7B和Gemini-Pro等模型競爭,並進行了非平凡的圖像生成,全部在一個模型中實現。根據人類對新的長篇混合模態生成評估的判斷,其中提示或輸出包含圖像和文本混合序列,Chameleon與Gemini Pro和GPT-4V等更大型模型的性能相匹敵或超越。Chameleon標誌著在統一建模完整多模態文檔方面邁出了重要一步。
English
We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

Summary

AI-Generated Summary

PDF13112December 15, 2024