联合训练大型自回归多模态模型

摘要

近年来，语言和文本到图像模型的大规模预训练取得了重大进展，彻底改变了机器学习领域。然而，将这两种模态整合到一个能够生成无缝多模输出的强大模型仍然是一个重大挑战。为了解决这一问题，我们提出了联合自回归混合（JAM）框架，这是一种模块化方法，系统地融合了现有的文本和图像生成模型。我们还引入了一种专门针对混合模态生成任务的数据高效指导调整策略。我们最终的指导调整模型展示了在生成高质量多模输出方面无与伦比的性能，并且是第一个明确为此目的而设计的模型。

English

In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.

联合训练大型自回归多模态模型

Jointly Training Large Autoregressive Multimodal Models

摘要

Support