ANOLE：一种开放的、自回归的、本地的大型多模态模型，用于交织的图像文本生成。

摘要

先前的开源大型多模态模型(LMMs)存在几个限制：(1) 它们通常缺乏本地集成，需要适配器来将视觉表示与预训练的大型语言模型(LLMs)对齐；(2) 许多模型仅限于单模态生成；(3) 虽然一些支持多模态生成，但它们依赖于用于视觉建模和生成的独立扩散模型。为了缓解这些限制，我们提出了Anole，这是一个开放的、自回归的、本地的大型多模态模型，用于交错的图像-文本生成。我们从Meta AI的Chameleon构建了Anole，采用了一种既数据高效又参数高效的创新微调策略。Anole展示了高质量、连贯的多模态生成能力。我们已经开源了我们的模型、训练框架和指导微调数据。

English

Previous open-source large multimodal models (LMMs) have faced several limitations: (1) they often lack native integration, requiring adapters to align visual representations with pre-trained large language models (LLMs); (2) many are restricted to single-modal generation; (3) while some support multimodal generation, they rely on separate diffusion models for visual modeling and generation. To mitigate these limitations, we present Anole, an open, autoregressive, native large multimodal model for interleaved image-text generation. We build Anole from Meta AI's Chameleon, adopting an innovative fine-tuning strategy that is both data-efficient and parameter-efficient. Anole demonstrates high-quality, coherent multimodal generation capabilities. We have open-sourced our model, training framework, and instruction tuning data.

ANOLE：一种开放的、自回归的、本地的大型多模态模型，用于交织的图像文本生成。

ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

摘要

Support