变色龙:混合模态早期融合基础模型
Chameleon: Mixed-Modal Early-Fusion Foundation Models
May 16, 2024
作者: Chameleon Team
cs.AI
摘要
我们提出了Chameleon,这是一系列早期融合基于标记的混合模态模型,能够理解和生成图像和文本,而且可以按任意顺序进行。我们概述了一个稳定的训练方法,一个对齐配方,以及专为早期融合、基于标记、混合模态设置量身定制的架构参数化。这些模型在一系列任务上进行了评估,包括视觉问答、图像字幕、文本生成、图像生成和长篇混合模态生成。Chameleon展示了广泛且通用的能力,包括在图像字幕任务中表现出色,优于Llama-2在仅文本任务中,同时与Mixtral 8x7B和Gemini-Pro等模型相竞争,并且在单一模型中执行了非平凡的图像生成。根据人类对新的长篇混合模态生成评估的判断,其中提示或输出包含图像和文本混合序列,它也与Gemini Pro和GPT-4V等更大模型的性能相匹敌或超越。Chameleon标志着在统一建模完整多模态文档方面迈出了重要一步。
English
We present Chameleon, a family of early-fusion token-based mixed-modal models
capable of understanding and generating images and text in any arbitrary
sequence. We outline a stable training approach from inception, an alignment
recipe, and an architectural parameterization tailored for the early-fusion,
token-based, mixed-modal setting. The models are evaluated on a comprehensive
range of tasks, including visual question answering, image captioning, text
generation, image generation, and long-form mixed modal generation. Chameleon
demonstrates broad and general capabilities, including state-of-the-art
performance in image captioning tasks, outperforms Llama-2 in text-only tasks
while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and
performs non-trivial image generation, all in a single model. It also matches
or exceeds the performance of much larger models, including Gemini Pro and
GPT-4V, according to human judgments on a new long-form mixed-modal generation
evaluation, where either the prompt or outputs contain mixed sequences of both
images and text. Chameleon marks a significant step forward in a unified
modeling of full multimodal documents.Summary
AI-Generated Summary