联合训练大型自回归多模态模型
Jointly Training Large Autoregressive Multimodal Models
September 27, 2023
作者: Emanuele Aiello, Lili Yu, Yixin Nie, Armen Aghajanyan, Barlas Oguz
cs.AI
摘要
近年来,语言和文本到图像模型的大规模预训练取得了重大进展,彻底改变了机器学习领域。然而,将这两种模态整合到一个能够生成无缝多模输出的强大模型仍然是一个重大挑战。为了解决这一问题,我们提出了联合自回归混合(JAM)框架,这是一种模块化方法,系统地融合了现有的文本和图像生成模型。我们还引入了一种专门针对混合模态生成任务的数据高效指导调整策略。我们最终的指导调整模型展示了在生成高质量多模输出方面无与伦比的性能,并且是第一个明确为此目的而设计的模型。
English
In recent years, advances in the large-scale pretraining of language and
text-to-image models have revolutionized the field of machine learning. Yet,
integrating these two modalities into a single, robust model capable of
generating seamless multimodal outputs remains a significant challenge. To
address this gap, we present the Joint Autoregressive Mixture (JAM) framework,
a modular approach that systematically fuses existing text and image generation
models. We also introduce a specialized, data-efficient instruction-tuning
strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned
model demonstrates unparalleled performance in generating high-quality
multimodal outputs and represents the first model explicitly designed for this
purpose.