ANOLE:一种开放的、自回归的、本地的大型多模态模型,用于交织的图像文本生成。
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation
July 8, 2024
作者: Ethan Chern, Jiadi Su, Yan Ma, Pengfei Liu
cs.AI
摘要
先前的开源大型多模态模型(LMMs)存在几个限制:(1) 它们通常缺乏本地集成,需要适配器来将视觉表示与预训练的大型语言模型(LLMs)对齐;(2) 许多模型仅限于单模态生成;(3) 虽然一些支持多模态生成,但它们依赖于用于视觉建模和生成的独立扩散模型。为了缓解这些限制,我们提出了Anole,这是一个开放的、自回归的、本地的大型多模态模型,用于交错的图像-文本生成。我们从Meta AI的Chameleon构建了Anole,采用了一种既数据高效又参数高效的创新微调策略。Anole展示了高质量、连贯的多模态生成能力。我们已经开源了我们的模型、训练框架和指导微调数据。
English
Previous open-source large multimodal models (LMMs) have faced several
limitations: (1) they often lack native integration, requiring adapters to
align visual representations with pre-trained large language models (LLMs); (2)
many are restricted to single-modal generation; (3) while some support
multimodal generation, they rely on separate diffusion models for visual
modeling and generation. To mitigate these limitations, we present Anole, an
open, autoregressive, native large multimodal model for interleaved image-text
generation. We build Anole from Meta AI's Chameleon, adopting an innovative
fine-tuning strategy that is both data-efficient and parameter-efficient. Anole
demonstrates high-quality, coherent multimodal generation capabilities. We have
open-sourced our model, training framework, and instruction tuning data.Summary
AI-Generated Summary