ANOLE：一種開放、自回歸、本地的大型多模態模型，用於交錯的圖像-文本生成。

摘要

先前的開源大型多模型（LMMs）面臨了幾個限制：（1）它們通常缺乏本地集成，需要適配器來將視覺表示與預先訓練的大型語言模型（LLMs）對齊；（2）許多模型僅限於單模生成；（3）雖然有些支持多模生成，但它們依賴於用於視覺建模和生成的獨立擴散模型。為了緩解這些限制，我們提出了Anole，這是一個開放的、自回歸的、本地的大型多模型，用於交錯的圖像-文本生成。我們從Meta AI的Chameleon構建了Anole，採用了一種創新的微調策略，既具有數據效率又具有參數效率。Anole展示了高質量、連貫的多模生成能力。我們已經將我們的模型、訓練框架和指導調整數據開源。

English

Previous open-source large multimodal models (LMMs) have faced several limitations: (1) they often lack native integration, requiring adapters to align visual representations with pre-trained large language models (LLMs); (2) many are restricted to single-modal generation; (3) while some support multimodal generation, they rely on separate diffusion models for visual modeling and generation. To mitigate these limitations, we present Anole, an open, autoregressive, native large multimodal model for interleaved image-text generation. We build Anole from Meta AI's Chameleon, adopting an innovative fine-tuning strategy that is both data-efficient and parameter-efficient. Anole demonstrates high-quality, coherent multimodal generation capabilities. We have open-sourced our model, training framework, and instruction tuning data.

ANOLE：一種開放、自回歸、本地的大型多模態模型，用於交錯的圖像-文本生成。

ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

摘要

Support