ANOLE:一種開放、自回歸、本地的大型多模態模型,用於交錯的圖像-文本生成。
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation
July 8, 2024
作者: Ethan Chern, Jiadi Su, Yan Ma, Pengfei Liu
cs.AI
摘要
先前的開源大型多模型(LMMs)面臨了幾個限制:(1)它們通常缺乏本地集成,需要適配器來將視覺表示與預先訓練的大型語言模型(LLMs)對齊;(2)許多模型僅限於單模生成;(3)雖然有些支持多模生成,但它們依賴於用於視覺建模和生成的獨立擴散模型。為了緩解這些限制,我們提出了Anole,這是一個開放的、自回歸的、本地的大型多模型,用於交錯的圖像-文本生成。我們從Meta AI的Chameleon構建了Anole,採用了一種創新的微調策略,既具有數據效率又具有參數效率。Anole展示了高質量、連貫的多模生成能力。我們已經將我們的模型、訓練框架和指導調整數據開源。
English
Previous open-source large multimodal models (LMMs) have faced several
limitations: (1) they often lack native integration, requiring adapters to
align visual representations with pre-trained large language models (LLMs); (2)
many are restricted to single-modal generation; (3) while some support
multimodal generation, they rely on separate diffusion models for visual
modeling and generation. To mitigate these limitations, we present Anole, an
open, autoregressive, native large multimodal model for interleaved image-text
generation. We build Anole from Meta AI's Chameleon, adopting an innovative
fine-tuning strategy that is both data-efficient and parameter-efficient. Anole
demonstrates high-quality, coherent multimodal generation capabilities. We have
open-sourced our model, training framework, and instruction tuning data.Summary
AI-Generated Summary